Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow
Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera
TL;DR
DualFlow introduces a unified rectified-flow framework for efficient, multi-modal two-person motion generation that supports interactive and reactive tasks conditioned on text, music, and retrieved exemplars. By combining a Retrieval-Augmented Generation grounding, cross-modal conditioning, and a contrastive rectified-flow objective with Look-Ahead and causal attention, the method achieves fast, temporally coherent, and semantically aligned duet motions. The approach demonstrates state-of-the-art performance across text-to-motion, music-to-motion, and multi-modal duet benchmarks, with strong qualitative results and ablations validating component contributions. This work enables responsive, context-aware digital humans and immersive choreography in interactive settings.
Abstract
Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.
