Table of Contents
Fetching ...

MIBURI: Towards Expressive Interactive Gesture Synthesis

M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt

TL;DR

MIBURI is presented, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue and introduces auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses.

Abstract

Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.

MIBURI: Towards Expressive Interactive Gesture Synthesis

TL;DR

MIBURI is presented, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue and introduces auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses.

Abstract

Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.
Paper Structure (41 sections, 5 equations, 6 figures, 9 tables)

This paper contains 41 sections, 5 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Miburi: An online, causal framework for real-time dialogue and gesture generation. Given live speech, the system produces full-duplex responses with synchronized full-body gestures. Right: Interactive demo using our approach.
  • Figure 2: Overview. Existing solutions Nagy2021gesturebotchen2025taoavatar to animate ECAs involve a complex pipeline (above) of multiple components to generate gestures with speech. Miburi (below) generates full body co-speech gestures directly by utilizing internal semantic/acoustic tokens of speech-text foundation model defossez2024moshi.
  • Figure 3: Miburi Architecture. Given Moshi's speech/text tokens(\ref{['subsec:moshi']}), our approach generates a sequence of gesture tokens, which are obtained through Body-part aware Gesture Codecs(\ref{['subsec:rvq']}). This online framework takes in Moshi's text/speech token as input and predict gesture tokens through autoregressive temporal and kinematic transformers(\ref{['subsec:tdm']}).
  • Figure 4: User Study for Perceptual Evaluation. Here, the red line indicates chance level (50%) and *: ($p < 0.05$), ***: ($p < 0.001$).
  • Figure 5: System architecture of our real-time demo. The main inference process runs Moshi and Miburi in a continuous loop, while two parallel processes handle speech/text visualization and motion rendering. Data is streamed between processes at each timestep via websockets to support low-latency, full-duplex interaction. Right: the user-facing interface of the demo.
  • ...and 1 more figures