Table of Contents
Fetching ...

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Cheng Luo, Bizhu Wu, Bing Li, Jianfeng Ren, Ruibin Bai, Rong Qu, Linlin Shen, Bernard Ghanem

Abstract

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Abstract

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.
Paper Structure (60 sections, 46 equations, 8 figures, 10 tables)

This paper contains 60 sections, 46 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Illustration of the proposed new task: Reactive Listener Motion Generation from Speech Utterance. Given a speaker’s utterance, , transcript and/or audio (optionally supplemented with emotion), a generative model such as our ReactMotion generates a corresponding responsive body-motion sequence for the listener.
  • Figure 2: ReactMotionNet dataset construction. We curate dyadic listener motions (Step 1), synthesize speaker conditions via inverse inference and Text-to-Speech (TTS) (Step 2), filter unreliable samples (Step 3), and rank/re-tier speaker--listener pairs into gold/silver/negative preferences (Step 4).
  • Figure 3: Overview of the ReactMotion framework. We use modality-specific tokenizers to convert raw data, i.e., the speaker’s utterances (including transcript, audio, and emotion) and the listener’s reactive motions, into discrete special tokens. With these tokenizers, a Seq2Seq model is employed to integrate information across modalities and learns to generate the listener’s reactive motions from the speaker’s utterances.
  • Figure 4: Qualitative results. We compare gold and silver listener reactions, motions generated by our ReactMotion (Ours), a cross-entropy trained variant (CE), and a cascaded LLM$\rightarrow$T2M baseline, all conditioned on the same speaker utterance. We visualize the resulting 3D motion sequences.
  • Figure 5: User study on reactive appropriateness.
  • ...and 3 more figures