Table of Contents
Fetching ...

U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

Xiang Deng, Feng Gao, Yong Zhang, Youxin Pang, Xu Xiaoming, Zhuoliang Kang, Xiaoming Wei, Yebin Liu

TL;DR

U-Mind is introduced, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop and implements a Unified Alignment and Reasoning Framework that addresses two key challenges.

Abstract

Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.

U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

TL;DR

U-Mind is introduced, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop and implements a Unified Alignment and Reasoning Framework that addresses two key challenges.

Abstract

Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.
Paper Structure (15 sections, 4 figures, 6 tables)

This paper contains 15 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Given a user query in text or speech, our system performs internal Chain-of-Thought (CoT) planning and produces synchronized responses across text, speech, and gesture. As shown, the model can handle both open-domain dialogue and various instruction-following, generating coherent language, natural prosody, and expressive body motion. The final output is rendered into photorealistic talking videos, showcasing our framework’s capability for high-level multimodal understanding and generation.
  • Figure 2: Method Overview. Our framework adopts a two-stage training paradigm. In Stage 1, we conduct rehearsal-driven pretraining to preserve symbolic reasoning (Textual QA), maintain speech alignment (Text2Speech), and learn new modalities (Text2Motion, Speech2Motion). All tasks are unified via discrete tokens processed by a shared U-mind backbone. In Stage 2, we instruction-tune the model with multimodal prompts (text or audio), generating CoT plans followed by coherent outputs across modalities.
  • Figure 3: Multimodal Dialogue Results. U-Mind performs CoT-based reasoning and generates synchronized speech and motion, producing photorealistic, context-aware responses. In contrast, SOLAMI degenerates into generic gestures without understanding the prompt, while LLM+TTS+LOM lacks coherence and cross-modal grounding.
  • Figure 4: Intruction-following Results. U-Mind interprets the user's intent through CoT planning and generates expressive, context-aware motions with realistic video output. In contrast, SOLAMI produces a shallow, literal response without understanding or simulating the intended imaginary action, while LLM+TTS+LOM lacks embodiment and cross-modal synchronization.