Table of Contents
Fetching ...

TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

Ruineng Li, Daitao Xing, Huiming Sun, Yuanzhou Ha, Jinglin Shen, Chiuman Ho

TL;DR

TokenMotion tackles the challenge of human-centric video generation with joint camera and human motion control. It introduces a DiT-based framework that represents camera trajectories and human poses as spatio-temporal tokens and uses a decouple-and-fuse strategy bridged by a dynamic mask, along with motion patchification to create fixed-length motion sequences. Two motion encoders process $z_{camera}$ and $z_{pose}$, which are fused into $z_{fused}$ via Softmax-weighted masking and CrossAttn with LoRA modulation, enabling fine-grained control in both text-to-video and image-to-video settings. Across real-world datasets and rigorous baselines, TokenMotion delivers superior spatiotemporal coherence and accurate motion alignment, highlighting its potential for controllable, high-fidelity video generation in creative production.

Abstract

Human-centric motion control in video generation remains a critical challenge, particularly when jointly controlling camera movements and human poses in scenarios like the iconic Grammy Glambot moment. While recent video diffusion models have made significant progress, existing approaches struggle with limited motion representations and inadequate integration of camera and human motion controls. In this work, we present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion, human motion, and their joint interaction. We represent camera trajectories and human poses as spatio-temporal tokens to enable local control granularity. Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask that effectively handles the spatially-and-temporally varying nature of combined motion signals. Through extensive experiments, we demonstrate TokenMotion's effectiveness across both text-to-video and image-to-video paradigms, consistently outperforming current state-of-the-art methods in human-centric motion control tasks. Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications.

TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

TL;DR

TokenMotion tackles the challenge of human-centric video generation with joint camera and human motion control. It introduces a DiT-based framework that represents camera trajectories and human poses as spatio-temporal tokens and uses a decouple-and-fuse strategy bridged by a dynamic mask, along with motion patchification to create fixed-length motion sequences. Two motion encoders process and , which are fused into via Softmax-weighted masking and CrossAttn with LoRA modulation, enabling fine-grained control in both text-to-video and image-to-video settings. Across real-world datasets and rigorous baselines, TokenMotion delivers superior spatiotemporal coherence and accurate motion alignment, highlighting its potential for controllable, high-fidelity video generation in creative production.

Abstract

Human-centric motion control in video generation remains a critical challenge, particularly when jointly controlling camera movements and human poses in scenarios like the iconic Grammy Glambot moment. While recent video diffusion models have made significant progress, existing approaches struggle with limited motion representations and inadequate integration of camera and human motion controls. In this work, we present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion, human motion, and their joint interaction. We represent camera trajectories and human poses as spatio-temporal tokens to enable local control granularity. Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask that effectively handles the spatially-and-temporally varying nature of combined motion signals. Through extensive experiments, we demonstrate TokenMotion's effectiveness across both text-to-video and image-to-video paradigms, consistently outperforming current state-of-the-art methods in human-centric motion control tasks. Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications.

Paper Structure

This paper contains 27 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: TokenMotion is a transformer-based video generation framework that enables simultaneous control of camera trajectories and human kinematic patterns. The framework demonstrates versatility across both text-to-video and image-to-video generation paradigms, while supporting flexible control configurations. *Text prompts are abbreviated for conciseness.
  • Figure 2: Architectural Overview. TokenMotion presents a novel video generation framework that combines a transformer-based video diffusion model with content-aware motion guidance. The architecture employs dual motion encoders that extract spatio-temporal motion tokens. These motion features are then processed through a specialized decoupling and fusion module, which dynamically modulates the strength of motion guidance based on content characteristics, enabling fine-grained control over temporal consistency.
  • Figure 3: Visualization of joint-control video generation results from Direct-A-Video DirectAVideo24Yang, MotionCtrl MotionCtrl24Wang, MotionBooth MotionBooth24Wu and our TokenMotion-T. Above cases shows that our TokenMotion method succeeds in jointly handling controls of both human motion and camera motion, while being consistently aligned with the input prompts at the same time.
  • Figure 4: Visualization of camera control over diverse objects, scenarios, and complex camera motion. Our model shows superior visual fidelity and enhanced control flexibility compared to commercial tools (Runway). *Text prompts are omitted to save space.
  • Figure 5: Qualitative results for TokenMotion-I.
  • ...and 2 more figures