Table of Contents
Fetching ...

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu

Abstract

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

Abstract

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.
Paper Structure (58 sections, 19 equations, 16 figures, 15 tables)

This paper contains 58 sections, 19 equations, 16 figures, 15 tables.

Figures (16)

  • Figure 1: Left: Overview and performance comparison of UniMotion, a unified framework for any-to-any Motion, Text, and Vision understanding, generation, and editing. UniMotion is the first model to support all seven tri-modal tasks and achieves consistent superiority over existing methods. Right: Representative task demonstrations.
  • Figure 2: Overview of UniMotion. (Left) UniMotion unifies motion, text, and RGB through symmetric continuous pathways: motion and images are encoded into continuous latents (via CMA-VAE and a vision VAE), mapped by a dual-path embedder that separates semantic abstraction from detail-preserving generation, and processed by a shared backbone for both multimodal understanding and modality-specific flow-based synthesis. (Right) Latent Reconstruction Alignment (LRA) pre-trains the motion pathway with a self-supervised Motion-to-Motion task, using motion latents as dense, unambiguous conditions to reconstruct motion from noise, thereby co-calibrating the embedder, backbone, and motion head before all downstream tri-modal learning.
  • Figure 3: CMA-VAE with DPA. CMA-VAE learns a continuous motion latent space using a motion-only encoder for inference and a vision-fused encoder for training-time visual supervision. When paired images are available, motion-guided visual features are fused with motion and distilled via DPA, enabling the shared decoder to learn visually informed motion latents without requiring images at inference.
  • Figure 4: Qualitative comparison on T2M and M2T.
  • Figure 5: Qualitative comparison of text-driven motion generation on HumanML3D humanml3d. Each column corresponds to one text prompt, and rows show outputs from MotionGPT motiongpt, MoMask momask, and UniMotion. Red text highlights the key prompt constraints, while red dashed boxes mark prompt--motion mismatches in the baseline outputs, including missing body-part constraints, incorrect motion trajectories, and weak temporal modifiers. UniMotion produces motions with closer prompt correspondence and more coherent temporal transitions.
  • ...and 11 more figures