Table of Contents
Fetching ...

ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek

TL;DR

ReMoS tackles the problem of reactive, two-person 3D motion synthesis with full-body and finger articulation by learning a conditional distribution $P(X|Y)$ through a cascaded diffusion framework. It introduces a novel two-stage generation (body then hands) with a combined spatio-temporal cross-attention (CoST-XA) and a hand-interaction-aware cross-attention (H-XA), along with a distance-aware reaction loss and inference-time spatial guidance. The authors also contribute the ReMoCap dataset, featuring Lindy Hop and Ninjutsu with finger-level data, enabling realistic inter-person interactions. Quantitative and user studies show state-of-the-art performance on multiple datasets and demonstrate practical motion-editing applications such as pose completion and in-betweening, advancing animation and interactive robotics. Overall, ReMoS provides annotation-free, diffusion-based reactive motion synthesis with strong inter-person coordination and fine-grained hand dynamics, suitable for immersive character animation pipelines.

Abstract

Current approaches for 3D human motion synthesis generate high quality animations of digital humans performing a wide variety of actions and gestures. However, a notable technological gap exists in addressing the complex dynamics of multi human interactions within this paradigm. In this work, we present ReMoS, a denoising diffusion based model that synthesizes full body reactive motion of a person in a two person interaction scenario. Given the motion of one person, we employ a combined spatio temporal cross attention mechanism to synthesize the reactive body and hand motion of the second person, thereby completing the interactions between the two. We demonstrate ReMoS across challenging two person scenarios such as pair dancing, Ninjutsu, kickboxing, and acrobatics, where one persons movements have complex and diverse influences on the other. We also contribute the ReMoCap dataset for two person interactions containing full body and finger motions. We evaluate ReMoS through multiple quantitative metrics, qualitative visualizations, and a user study, and also indicate usability in interactive motion editing applications.

ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions

TL;DR

ReMoS tackles the problem of reactive, two-person 3D motion synthesis with full-body and finger articulation by learning a conditional distribution through a cascaded diffusion framework. It introduces a novel two-stage generation (body then hands) with a combined spatio-temporal cross-attention (CoST-XA) and a hand-interaction-aware cross-attention (H-XA), along with a distance-aware reaction loss and inference-time spatial guidance. The authors also contribute the ReMoCap dataset, featuring Lindy Hop and Ninjutsu with finger-level data, enabling realistic inter-person interactions. Quantitative and user studies show state-of-the-art performance on multiple datasets and demonstrate practical motion-editing applications such as pose completion and in-betweening, advancing animation and interactive robotics. Overall, ReMoS provides annotation-free, diffusion-based reactive motion synthesis with strong inter-person coordination and fine-grained hand dynamics, suitable for immersive character animation pipelines.

Abstract

Current approaches for 3D human motion synthesis generate high quality animations of digital humans performing a wide variety of actions and gestures. However, a notable technological gap exists in addressing the complex dynamics of multi human interactions within this paradigm. In this work, we present ReMoS, a denoising diffusion based model that synthesizes full body reactive motion of a person in a two person interaction scenario. Given the motion of one person, we employ a combined spatio temporal cross attention mechanism to synthesize the reactive body and hand motion of the second person, thereby completing the interactions between the two. We demonstrate ReMoS across challenging two person scenarios such as pair dancing, Ninjutsu, kickboxing, and acrobatics, where one persons movements have complex and diverse influences on the other. We also contribute the ReMoCap dataset for two person interactions containing full body and finger motions. We evaluate ReMoS through multiple quantitative metrics, qualitative visualizations, and a user study, and also indicate usability in interactive motion editing applications.
Paper Structure (49 sections, 14 equations, 7 figures, 7 tables)

This paper contains 49 sections, 14 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Visualizations of reactive 3D motion sequences synthesized with the proposed ReMoS approach. We synthesize the 3D full-body motion of the reactor (blue) conditioned only on the 3D motion of the actor (red), thereby completing the interactions between the two (Ninjutsu practice on the left and Lindy Hop dancing on the right). The synthesized hand interactions are enlarged and highlighted with circles.
  • Figure 2: ReMoS Overview. Given the motion of the actor (bottom-middle, in red), we synthesize a plausible motion for the reactor (bottom-left, in blue). We achieve this using a denoising diffusion-based probabilistic model (center) trained on reactive motion sequences (top-left, in blue).
  • Figure 3: ReMoS Framework. Given the full-body sequence of the actor (left, in red), we input noisy body and hand samples (from below) in a cascaded fashion. We synthesize the body samples first, and use them for hand-interaction-aware attention masking (top-center) to synthesize the denoised hand samples (top-right). The full-body reactive motion is a concatenation of the denoised body and hand samples (right, in blue).
  • Figure 4: Visualization of Distance Aware Reaction Loss. We use an exponentially decaying distance-aware reaction loss to focus more on the reactor’s joints that are closer to the actor.
  • Figure 5: Qualitative Results and Applications. We show some visual results and the application of ReMoS as a motion editing tool. (a) The reactor (in blue) synthesized by ReMoS has the most plausible alignment with the actor (in red) compared to the baselines. (b) We manually control the right-hand wrist joint of the reactor and let ReMoS synthesize the remaining body joints conditioned on the actor. (c) ReMoS synthesizes the reactor's motion in-between the start and end frames.
  • ...and 2 more figures