FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Mingzhi Sheng; Zekai Gu; Peng Li; Cheng Lin; Hao-Xiang Guo; Ying-Cong Chen; Yuan Liu

FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Mingzhi Sheng, Zekai Gu, Peng Li, Cheng Lin, Hao-Xiang Guo, Ying-Cong Chen, Yuan Liu

TL;DR

FlexAM tackles controllable video generation by introducing a unified appearance-motion decomposition powered by a novel 3D control signal shaped as a dynamic point cloud. The motion signal combines multi-frequency positional encoding and depth-aware features, while appearance conditioning operates on arbitrarily masked videos, enabling broad I2V/V2V editing scenarios. Integrated into a diffusion-based generator with density-aware training, FlexAM demonstrates state-of-the-art performance across appearance editing, camera control, and spatial object editing. This approach offers robust, 3D-aware video generation with flexible control that scales to diverse editing tasks and camera manipulations.

Abstract

Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of "appearance" and "motion" provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.

FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

TL;DR

Abstract

Paper Structure (29 sections, 4 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 29 sections, 4 equations, 9 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Controllable Video Generation
Appearance Editing
Camera Control
Spatial Object Editing
Method
Appearance Control Signal Representation
Motion Control Signal Representation
Architecture and Training Integration
Controllable Video Generation and Editing
Appearance Editing.
Camera Control.
Spatial Object Editing.
Experiment
...and 14 more sections

Figures (9)

Figure 1: FlexAM treats controllable video generation as a fundamental disentanglement of appearance and motion. It defines a novel 3D control signal, based on a dynamic point cloud, that explicitly represents motion with flexible, precise, and depth-aware. This approach allows FlexAM as a unified model to achieve a wide range of tasks, including I2V/V2V editing, camera control, and spatial object editing.
Figure 2: The FlexAM pipeline. Our approach disentangles video generation into appearance and motion control. The input video is first processed to create a 3D point cloud, which is then rendered into a motion video with multi-attributes, serving as the motion control signal. This motion control signal, along with a masked input video (for appearance control), is fed into the FlexAM generative model. FlexAM, processes these control signals—via VAE encoders, Adapter, and a tokenizer—alongside video, motion, input, and mask tokens. The model then generates a new video by integrating these decoupled appearance and motion controls, as illustrated by the example of transforming a polar bear video into a wolf video while maintaining motion dynamics.
Figure 3: Qualitative comparison on motion transfer between our method, DaS, Wan2.2 Fun, and VACE. We compare the results of different methods in transferring the human motion from the Source to a new appearance. Compared to the baseline, our method accurately transfers the motion.
Figure 4: Qualitative comparison on foreground and background editing. We transfer the motion from the source videos while replacing the foreground/ background appearance using the reference prompt/image. (a) Replace bear with Godzilla; Compared to VACE, our method better follows the reference poses and preserves identity and color details. (b) Airplane wing over mountains at sunset; While VACE maintains foreground consistency but loses background motion, our method integrates the input video’s background motion with the new appearance, preserving coherent dynamics.
Figure 5: Qualitative comparison on camera control. We re-render the source video with the pan up-right target camera trajectory. ReCamMaster shows artifacts and deviates from the path; DaS fails to track the target pose. Our method closely matches the target trajectory while preserving appearance and temporal stability.
...and 4 more figures

FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

TL;DR

Abstract

FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Authors

TL;DR

Abstract

Table of Contents

Figures (9)