Table of Contents
Fetching ...

Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera

TL;DR

DualFlow introduces a unified rectified-flow framework for efficient, multi-modal two-person motion generation that supports interactive and reactive tasks conditioned on text, music, and retrieved exemplars. By combining a Retrieval-Augmented Generation grounding, cross-modal conditioning, and a contrastive rectified-flow objective with Look-Ahead and causal attention, the method achieves fast, temporally coherent, and semantically aligned duet motions. The approach demonstrates state-of-the-art performance across text-to-motion, music-to-motion, and multi-modal duet benchmarks, with strong qualitative results and ablations validating component contributions. This work enables responsive, context-aware digital humans and immersive choreography in interactive settings.

Abstract

Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

TL;DR

DualFlow introduces a unified rectified-flow framework for efficient, multi-modal two-person motion generation that supports interactive and reactive tasks conditioned on text, music, and retrieved exemplars. By combining a Retrieval-Augmented Generation grounding, cross-modal conditioning, and a contrastive rectified-flow objective with Look-Ahead and causal attention, the method achieves fast, temporally coherent, and semantically aligned duet motions. The approach demonstrates state-of-the-art performance across text-to-motion, music-to-motion, and multi-modal duet benchmarks, with strong qualitative results and ablations validating component contributions. This work enables responsive, context-aware digital humans and immersive choreography in interactive settings.

Abstract

Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

Paper Structure

This paper contains 26 sections, 17 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Our DualFlow model unifies two tasks: (a) Interactive Motion Generation, which synthesizes synchronized two-person interactions, (b) Reactive Motion Generation, which generates responsive motions for Person B (red) conditioned on Person A’s (blue) movements. The generation process is conditioned jointly on text, music, and the retrieved motion samples.
  • Figure 2: (a) Our framework takes text (CLIP-L/14), music, and motion sequences from an actor (A) and reactor (B) as inputs. Motion samples are retrieved using music features and LLM-decomposed text cues (spatial relationship, body movement, rhythm). These modality-specific latents are processed by cascaded Multi-Modal DualFlow Blocks that model interactive dynamics. Outputs are either both actors’ motions (interactive) or only the reactor’s motion (reactive) via a masking mechanism. (b) A DualFlow Block: in the interactive setting, both branches operate symmetrically with Motion Cross Attention coordinating joint motion; in the reactive setting, the actor branch is masked and the reactor branch employs a Causal Cross Attention module with Look-Ahead $L$, replacing Motion Cross Attention, to condition on the actor’s motion.
  • Figure 2: Interactive Two-person Generation results conditioned on text modality for the InterHuman-AS dataset.
  • Figure 3: Reactive Motion Generation results conditioned on text modality for the DD100 dataset.
  • Figure 4: FID vs. Steps
  • ...and 2 more figures