ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Cheng Luo; Bizhu Wu; Bing Li; Jianfeng Ren; Ruibin Bai; Rong Qu; Linlin Shen; Bernard Ghanem

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Cheng Luo, Bizhu Wu, Bing Li, Jianfeng Ren, Ruibin Bai, Rong Qu, Linlin Shen, Bernard Ghanem

Abstract

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Abstract

Paper Structure (60 sections, 46 equations, 8 figures, 10 tables)

This paper contains 60 sections, 46 equations, 8 figures, 10 tables.

Introduction
Contributions.
Related Work
Task Definition
ReactMotionNet Dataset
Dataset Construction Pipeline
Step 1: Dyadic Listener Reactive Motion Curation.
Step 2: Inverse Speaker-Condition Synthesis.
Step 3: Data Filtering.
Step 4: Speaker–Listener Candidate Ranking and Preference Tiering.
Dataset Statistics
Methodology
Modality-Specific Tokenization
Audio Tokenization.
Motion Tokenization.
...and 45 more sections

Figures (8)

Figure 1: Illustration of the proposed new task: Reactive Listener Motion Generation from Speech Utterance. Given a speaker’s utterance, , transcript and/or audio (optionally supplemented with emotion), a generative model such as our ReactMotion generates a corresponding responsive body-motion sequence for the listener.
Figure 2: ReactMotionNet dataset construction. We curate dyadic listener motions (Step 1), synthesize speaker conditions via inverse inference and Text-to-Speech (TTS) (Step 2), filter unreliable samples (Step 3), and rank/re-tier speaker--listener pairs into gold/silver/negative preferences (Step 4).
Figure 3: Overview of the ReactMotion framework. We use modality-specific tokenizers to convert raw data, i.e., the speaker’s utterances (including transcript, audio, and emotion) and the listener’s reactive motions, into discrete special tokens. With these tokenizers, a Seq2Seq model is employed to integrate information across modalities and learns to generate the listener’s reactive motions from the speaker’s utterances.
Figure 4: Qualitative results. We compare gold and silver listener reactions, motions generated by our ReactMotion (Ours), a cross-entropy trained variant (CE), and a cascaded LLM$\rightarrow$T2M baseline, all conditioned on the same speaker utterance. We visualize the resulting 3D motion sequences.
Figure 5: User study on reactive appropriateness.
...and 3 more figures

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Abstract

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Authors

Abstract

Table of Contents

Figures (8)