Table of Contents
Fetching ...

MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context

Junyu Shi, Yong Sun, Zhiyuan Zhang, Lijiang Liu, Zhengjie Zhang, Yuxin He, Qiang Nie

TL;DR

MoGIC introduces a unified multimodal framework that explicitly models human intention and leverages visual priors to boost motion generation conditioned on language, vision, and partial motions. It combines modality-specific encoders, a Conditional Masked Transformer with semantic modulation and adaptive mixture of attention, and disentangled heads for intention prediction and motion generation, trained jointly on five cross-modal tasks. A large-scale Mo440H benchmark (440 hours from 21 datasets) underpins tri-modal training and evaluation, enabling vision-conditioned generation, in-between tasks, and intention understanding. Empirical results show substantial improvements in motion fidelity (FID reductions on HumanML3D and Mo440H), effective captioning with lightweight language heads, and new capabilities such as image-to-motion synthesis and vision-guided completion, highlighting the potential of intention-aware, multimodal motion synthesis for controllable embodied AI.

Abstract

Existing text-driven motion generation methods often treat synthesis as a bidirectional mapping between language and motion, but remain limited in capturing the causal logic of action execution and the human intentions that drive behavior. The absence of visual grounding further restricts precision and personalization, as language alone cannot specify fine-grained spatiotemporal details. We propose MoGIC, a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis. By jointly optimizing multimodal-conditioned motion generation and intention prediction, MoGIC uncovers latent human goals, leverages visual priors to enhance generation, and exhibits versatile multimodal generative capability. We further introduce a mixture-of-attention mechanism with adaptive scope to enable effective local alignment between conditional tokens and motion subsequences. To support this paradigm, we curate Mo440H, a 440-hour benchmark from 21 high-quality motion datasets. Experiments show that after finetuning, MoGIC reduces FID by 38.6\% on HumanML3D and 34.6\% on Mo440H, surpasses LLM-based methods in motion captioning with a lightweight text head, and further enables intention prediction and vision-conditioned generation, advancing controllable motion synthesis and intention understanding. The code is available at https://github.com/JunyuShi02/MoGIC

MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context

TL;DR

MoGIC introduces a unified multimodal framework that explicitly models human intention and leverages visual priors to boost motion generation conditioned on language, vision, and partial motions. It combines modality-specific encoders, a Conditional Masked Transformer with semantic modulation and adaptive mixture of attention, and disentangled heads for intention prediction and motion generation, trained jointly on five cross-modal tasks. A large-scale Mo440H benchmark (440 hours from 21 datasets) underpins tri-modal training and evaluation, enabling vision-conditioned generation, in-between tasks, and intention understanding. Empirical results show substantial improvements in motion fidelity (FID reductions on HumanML3D and Mo440H), effective captioning with lightweight language heads, and new capabilities such as image-to-motion synthesis and vision-guided completion, highlighting the potential of intention-aware, multimodal motion synthesis for controllable embodied AI.

Abstract

Existing text-driven motion generation methods often treat synthesis as a bidirectional mapping between language and motion, but remain limited in capturing the causal logic of action execution and the human intentions that drive behavior. The absence of visual grounding further restricts precision and personalization, as language alone cannot specify fine-grained spatiotemporal details. We propose MoGIC, a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis. By jointly optimizing multimodal-conditioned motion generation and intention prediction, MoGIC uncovers latent human goals, leverages visual priors to enhance generation, and exhibits versatile multimodal generative capability. We further introduce a mixture-of-attention mechanism with adaptive scope to enable effective local alignment between conditional tokens and motion subsequences. To support this paradigm, we curate Mo440H, a 440-hour benchmark from 21 high-quality motion datasets. Experiments show that after finetuning, MoGIC reduces FID by 38.6\% on HumanML3D and 34.6\% on Mo440H, surpasses LLM-based methods in motion captioning with a lightweight text head, and further enables intention prediction and vision-conditioned generation, advancing controllable motion synthesis and intention understanding. The code is available at https://github.com/JunyuShi02/MoGIC

Paper Structure

This paper contains 40 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of MoGIC. The framework consists of modality-specific encoders, a Conditional Masked Transformer (CMT), a Motion Generation Head (MGH), and an Intention Prediction Head (IPH). Language, vision, and motion inputs are first processed by their respective encoders to produce latent tokens. Motion tokens are randomly masked and passed through the CMT, where semantic-level and fine-grained conditions modulate the motion token in series. The resulting conditional tokens $z$ are used in two branches: (i) the masked motion tokens are reconstructed via the MGH, which denoises them into clean motion latent tokens and decodes them into motion sequences; (ii) $z$ serves as key and query signals for the IPH to predict the underlying intention.
  • Figure 2: Comparisons of intention prediction results.
  • Figure 3: Visualization of motion generation and motion in-between tasks with vision modality.
  • Figure 4: The effectiveness of mixture-of-attention.
  • Figure A1: Visualization of data distributions. For each dataset, we randomly sample 2,000 motion sequences. Each sequence is temporally averaged to obtain a compact feature representation, which is then reduced in dimensionality using t-SNE.
  • ...and 3 more figures