Table of Contents
Fetching ...

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Yifang Men, Yuan Yao, Miaomiao Cui, Liefeng Bo

TL;DR

MIMO introduces a spatially decomposed, 3D-aware diffusion framework for controllable character video synthesis from simple inputs. By lifting 2D frames into 3D and separating them into human, scene, and occlusion components, and further disentangling identity and motion in a structured latent space, it enables arbitrary-character control, novel 3D motions, and real-world scene interactions. The approach achieves strong quantitative and qualitative performance, outperforming prior 2D diffusion-based methods and showcasing robustness to occlusions and scene complexity. This 3D-aware decomposition strategy offers a promising direction for scalable, interactive video synthesis in real-world contexts.

Abstract

Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

TL;DR

MIMO introduces a spatially decomposed, 3D-aware diffusion framework for controllable character video synthesis from simple inputs. By lifting 2D frames into 3D and separating them into human, scene, and occlusion components, and further disentangling identity and motion in a structured latent space, it enables arbitrary-character control, novel 3D motions, and real-world scene interactions. The approach achieves strong quantitative and qualitative performance, outperforming prior 2D diffusion-based methods and showcasing robustness to occlusions and scene complexity. This 3D-aware decomposition strategy offers a promising direction for scalable, interactive video synthesis in real-world contexts.

Abstract

Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.
Paper Structure (13 sections, 2 equations, 8 figures, 2 tables)

This paper contains 13 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Given a single reference image of character, MIMO can synthesize animated avatars in driving 3D poses (visualized as skeleton sequences) retrieved from motion datasets (left) or extracted from in-the-wild videos (right). Real-world scenes from driving videos can also be integrated into the synthesis with natural human-object interactions. MIMO simultaneously achieves advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework.
  • Figure 2: The basic idea of MIMO. Controllable character video synthesis with desired attributes provided by multiple inputs (e.g., a single image for character, a pose sequence for motion, and a single video even an image for scene) or a driving video. Target attributes are embedded into the latent space as the target codes and the driving video is spatially decomposed as the spatial codes. Target character videos can be generated in user control with the combined attribute codes.
  • Figure 3: An overview of the proposed framework. The video clip is decomposed to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on 3D depth. The human component is further disentangled for properties of identity and motion via canonical appearance transfer and structured body codes, and encoded to identity code $\mathcal{C}_{id}$ and motion code $\mathcal{C}_{mo}$. The scene and occlusion components are embedded with a shared VAE encoder and re-organized as a full scene code $\mathcal{C}_{so}$. These latent codes are inserted into a diffusion-based decoder as conditions for video reconstruction.
  • Figure 4: The architecture of the diffusion-based decoder.
  • Figure 5: Results of animating diverse characters (e.g., realistic humans, cartoon characters and personified ones) with novel 3D motions retrieved from the motion database (a) or extracted from the driving video (b), and interactive scenes from in-the-wild videos (c).
  • ...and 3 more figures