FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Computer Vision & MultiModal AI
Published: arXiv: 2602.13185v1
Authors

Mingzhi Sheng Zekai Gu Peng Li Cheng Lin Hao-Xiang Guo Ying-Cong Chen Yuan Liu

Abstract

Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of "appearance" and "motion" provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.

Paper Summary

Problem
Effective and generalizable control in video generation remains a significant challenge in the field of computer vision. Current methods often rely on ambiguous or task-specific signals, which can limit their scalability and robustness. Researchers have explored decomposing videos into various elemental signals, but these approaches can be inefficient and require bespoke training data and model designs.
Key Innovation
The proposed solution, FlexAM, introduces a novel 3D control signal that represents video dynamics as a dynamic point cloud. This signal is enhanced with multi-frequency positional encoding, depth-aware positional encoding, and a flexible control signal. FlexAM effectively disentangles appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing.
Practical Impact
FlexAM has the potential to revolutionize the field of controllable video generation. By providing a unified framework for controlling video generation, FlexAM can enable a wide range of applications, including: * **Video editing**: FlexAM can be used to edit videos in various ways, such as changing the motion of objects or the camera trajectory. * **Virtual reality**: FlexAM can be used to create more realistic and interactive virtual reality experiences. * **Film and television production**: FlexAM can be used to create more realistic and engaging special effects.
Analogy / Intuitive Explanation
Think of FlexAM as a 3D map of a city, where each point on the map represents a specific location in space and time. The multi-frequency positional encoding is like a high-resolution map that shows the exact location of each point, while the depth-aware positional encoding is like a layer of shading that indicates the distance of each point from the viewer. The flexible control signal is like a set of traffic lights that control the flow of traffic on the map, allowing the model to precisely control the motion and trajectories of elements during generation and editing.
Paper Information
Categories:
cs.CV cs.GR
Published Date:

arXiv ID:

2602.13185v1

Quick Actions