Table of Contents
Fetching ...

ReactionMamba: Generating Short &Long Human Reaction Sequences

Hajra Anwar Beg, Baptiste Chopin, Hao Tang, Mohamed Daoudi

TL;DR

ReactionMamba introduces a VAE-based framework fused with Mamba selective state-space models to generate long-horizon, two-person reaction motions conditioned on a given actor action. The encoder maps the reactor's pose sequence to latent variables, while the decoder uses these latent codes together with the action sequence and initial pose to reconstruct coherent reaction sequences, enabling scalable, real-time generation. Across Lindy Hop, NTU120-AS, and InterX, the method delivers competitive realism and diversity with orders-of-magnitude faster inference than transformer-based baselines, and demonstrates robustness in long-horizon scenarios. Ablation studies confirm the value of direct initial-pose and action conditioning, while limitations point to adaptive conditioning and enhanced foot-ground realism as future directions.

Abstract

We present ReactionMamba, a novel framework for generating long 3D human reaction motions. Reaction-Mamba integrates a motion VAE for efficient motion encoding with Mamba-based state-space models to decode temporally consistent reactions. This design enables ReactionMamba to generate both short sequences of simple motions and long sequences of complex motions, such as dance and martial arts. We evaluate ReactionMamba on three datasets--NTU120-AS, Lindy Hop, and InterX--and demonstrate competitive performance in terms of realism, diversity, and long-sequence generation compared to previous methods, including InterFormer, ReMoS, and Ready-to-React, while achieving substantial improvements in inference speed.

ReactionMamba: Generating Short &Long Human Reaction Sequences

TL;DR

ReactionMamba introduces a VAE-based framework fused with Mamba selective state-space models to generate long-horizon, two-person reaction motions conditioned on a given actor action. The encoder maps the reactor's pose sequence to latent variables, while the decoder uses these latent codes together with the action sequence and initial pose to reconstruct coherent reaction sequences, enabling scalable, real-time generation. Across Lindy Hop, NTU120-AS, and InterX, the method delivers competitive realism and diversity with orders-of-magnitude faster inference than transformer-based baselines, and demonstrates robustness in long-horizon scenarios. Ablation studies confirm the value of direct initial-pose and action conditioning, while limitations point to adaptive conditioning and enhanced foot-ground realism as future directions.

Abstract

We present ReactionMamba, a novel framework for generating long 3D human reaction motions. Reaction-Mamba integrates a motion VAE for efficient motion encoding with Mamba-based state-space models to decode temporally consistent reactions. This design enables ReactionMamba to generate both short sequences of simple motions and long sequences of complex motions, such as dance and martial arts. We evaluate ReactionMamba on three datasets--NTU120-AS, Lindy Hop, and InterX--and demonstrate competitive performance in terms of realism, diversity, and long-sequence generation compared to previous methods, including InterFormer, ReMoS, and Ready-to-React, while achieving substantial improvements in inference speed.

Paper Structure

This paper contains 20 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Examples of reactive 3D motion generated with our proposed model for the NTU120-AS dataset Cheers and drink. Here we synthesize the 3D pose of the reactor (green) conditioned on the 3D motion of the actor (blue) and the initial pose of the reaction.
  • Figure 2: Architecture of ReactionMamba. The Mamba encoder encodes the motion of the reaction $\mathbf{Y}$ and maps it to a latent representation $\mathbf{Z}$. The decoder network takes as input the concatenation of the latent vector $\mathbf{Z}$, a projection of the action motion sequence $\mathbf{X}$, and the initial pose $Y_1$, and produces the reconstructed reaction motion sequence $\hat{\mathbf{Y}}$.
  • Figure 3: Visualization of sequence generated on Lindy Hop (1000 frames). In blue the action motion used as condition. In red the ground truth reaction and in other colors the reaction generated by the different models. See the supplementary material for their corresponding animations
  • Figure 4: Visualization of sequence generated on NTU120-AS push class. In blue the action motion used as condition. In red the ground truth reaction and in other colors the reaction generated by the different models. See the supplementary material for their corresponding animations