Table of Contents
Fetching ...

ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data

Yaoqin Ye, Yiteng Xu, Qin Sun, Xinge Zhu, Yujing Sun, Yuexin Ma

Abstract

Human behaviors in real-world environments are inherently interactive, with an individual's motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego's future motion from dynamic multi-source cues, including others' actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.

ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data

Abstract

Human behaviors in real-world environments are inherently interactive, with an individual's motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego's future motion from dynamic multi-source cues, including others' actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.

Paper Structure

This paper contains 85 sections, 27 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 2: Overview of the ReMoGen Framework. Our framework is designed to address the challenges of data scarcity and real-time responsiveness in interaction-to-reaction generation.
  • Figure 3: Qualitative results of ReMoGen across Human--Human, Human--Scene, and mixed Human--Human--Scene scenarios. Blue meshes denote the generated ego motion, while red meshes represent the observed motions of others. The examples cover diverse interaction behaviors, including Taichi-style movements, chasing, chatting, and scene-aware interaction, demonstrating the versatility of ReMoGen across heterogeneous interaction settings.
  • Figure A: Architecture of Meta-Interaction Block.
  • Figure B: Results under no-text and shuffled-text settings. In no-text, the original intent is "Shake hands". In shuffled-text, the original intent is "Help others up" while the input is "Standing still".
  • Figure C: Qualitative comparisons on Human–Human Interaction tasks. For typographical reasons, we have presented the optimal offline version of FreeMotion. Our method produces smoother and more coordinated reactions aligned with the intent, whereas baselines exhibit unnatural timing, misaligned contact, or unstable body dynamics.
  • ...and 6 more figures