Table of Contents
Fetching ...

ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

Fengyuan Yang, Luying Huang, Jiazhi Guan, Quanwei Yang, Dongwei Pan, Jianglin Fu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Angela Yao

Abstract

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.

ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

Abstract

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.

Paper Structure

This paper contains 20 sections, 13 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Human-Environment Synthesis by ONE-SHOT. Three kinds of conditions are decoupled and factorized within our ONE-SHOT framework to enable compositional generation. Blue): human dynamics. Green): textual prompt. Orange): environment space. The generated videos exhibit high consistency in human subjects and environments w.r.t. given conditions, and further accurately convey the textual interaction of "holding a longsword".
  • Figure 2: Model Architecture. Our model is built upon a Pretrained VFM (gray, right) augmented with an additional Conditioning Branch (cyan, left) for concept injection. The environmental condition $\mathbf{c}_{\text{env}}$ is encoded from 2D-projected point clouds and depth maps, while identity appearance $\mathbf{c}_{\text{id}}$ and context memory $\mathbf{c}_{\text{mem}}$ jointly maintain visual coherence. Human dynamics $\mathbf{c}_{\text{mot}}$ are disentangled from environmental inputs and injected through the proposed Decoupled Motion Cross-Attention.
  • Figure 3: Decoupled Motion Cross-attention. The canonical-space human pose is injected into the video through cross-attention, where the proposed Dynamic-Grounded-RoPE bridges the spatial discrepancy between the environment and human spaces.
  • Figure 4: Self-Reconstruction and Cross-Composition Comparisons on Traj100. (a) Our method better preserves both the scene structure and human appearance, and (b) maintains stable subject placement and motion consistency when swapping identity.
  • Figure 5: Text-guided editing with compositional controls. Beyond composing scene, identity, and motion, our method enables instruction-based editing via text prompts, indicating strong compatibility with the pretrained VFM and minimal loss of its native text-conditioned editing ability.
  • ...and 2 more figures