Table of Contents
Fetching ...

Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation

Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, Ruizhen Hu

TL;DR

This paper introduces Ready-to-React, an online reaction policy that enables two independently acting characters to interact in real time. It combines a VQ-VAE-based latent space with a transformer-conditioned diffusion predictor and an online decoder to generate next poses streaming from observed histories, mitigating error accumulation. Evaluated on a boxing dataset (DuoBox), the method outperforms baselines in reactive and two-character generation, including long sequences, and supports sparse, controllable inputs for VR applications. The approach advances online, interactive motion generation with practical implications for robotics, gaming, and immersive environments.

Abstract

This paper addresses the task of generating two-character online interactions. Previously, two main settings existed for two-character interaction generation: (1) generating one's motions based on the counterpart's complete motion sequence, and (2) jointly generating two-character motions based on specific conditions. We argue that these settings fail to model the process of real-life two-character interactions, where humans will react to their counterparts in real time and act as independent individuals. In contrast, we propose an online reaction policy, called Ready-to-React, to generate the next character pose based on past observed motions. Each character has its own reaction policy as its "brain", enabling them to interact like real humans in a streaming manner. Our policy is implemented by incorporating a diffusion head into an auto-regressive model, which can dynamically respond to the counterpart's motions while effectively mitigating the error accumulation throughout the generation process. We conduct comprehensive experiments using the challenging boxing task. Experimental results demonstrate that our method outperforms existing baselines and can generate extended motion sequences. Additionally, we show that our approach can be controlled by sparse signals, making it well-suited for VR and other online interactive environments.

Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation

TL;DR

This paper introduces Ready-to-React, an online reaction policy that enables two independently acting characters to interact in real time. It combines a VQ-VAE-based latent space with a transformer-conditioned diffusion predictor and an online decoder to generate next poses streaming from observed histories, mitigating error accumulation. Evaluated on a boxing dataset (DuoBox), the method outperforms baselines in reactive and two-character generation, including long sequences, and supports sparse, controllable inputs for VR applications. The approach advances online, interactive motion generation with practical implications for robotics, gaming, and immersive environments.

Abstract

This paper addresses the task of generating two-character online interactions. Previously, two main settings existed for two-character interaction generation: (1) generating one's motions based on the counterpart's complete motion sequence, and (2) jointly generating two-character motions based on specific conditions. We argue that these settings fail to model the process of real-life two-character interactions, where humans will react to their counterparts in real time and act as independent individuals. In contrast, we propose an online reaction policy, called Ready-to-React, to generate the next character pose based on past observed motions. Each character has its own reaction policy as its "brain", enabling them to interact like real humans in a streaming manner. Our policy is implemented by incorporating a diffusion head into an auto-regressive model, which can dynamically respond to the counterpart's motions while effectively mitigating the error accumulation throughout the generation process. We conduct comprehensive experiments using the challenging boxing task. Experimental results demonstrate that our method outperforms existing baselines and can generate extended motion sequences. Additionally, we show that our approach can be controlled by sparse signals, making it well-suited for VR and other online interactive environments.

Paper Structure

This paper contains 13 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Demonstration of Ready-to-React, an online reaction policy for two-character interaction generation on the challenging task of boxing. Ready-to-React predicts the next pose of an agent by considering its own and the counterpart's historical motions. Our method can successfully generate 1800 frames of motion, whereas the GPT-based approach struggles after about 200 frames, displaying issues such as incorrect orientation, leaving the ring boundary, or freezing in place due to the accumulation of errors over time.
  • Figure 2: Pipeline overview. Given a boxing scene at the leftmost figure, where the blue agent is thinking about its next move. The reaction policy (Section \ref{['sec:policy']}) follows these steps: first, based on the observations, the history encoder encodes the current state and observations; then, the next latent predictor predicts the upcoming motion latent; and finally, an online motion decoder decodes this motion latent into the actual next pose. The same reaction policy can be applied to the pink agent. Through a streaming process for both agents, our reaction policy enables the continuous generation of two-character motion sequences without length limit (Section \ref{['sec:twochar']}).
  • Figure 3: Visualization of the face direction relative to time. We compare our method with baselines in two scenarios described in Section \ref{['sec:baseline']}. The x-axis represents the frame number f, while the y-axis shows the angle between the facing directions of the two characters (in degrees). An angle of $0^{\circ}$ indicates that the two agents are facing each other, whereas $\pm 180^{\circ}$ means they are facing away from each other. The green lines represent the ground truth, the blue lines represent our method, and the red lines represent the baselines.
  • Figure 4: Qualitative results of generating reactive motions from sparse signals. We compare our method with CAMDM. Our approach successfully generates realistic motion while effectively adhering to the sparse signals (annotated by red dots in the figures). In contrast, CAMDM struggles to achieve the same level of responsiveness and accuracy, as shown in the red circles.
  • Figure 5: Qualitative results of generating reactive motions. Given the same ground truth opponent motion, InterFormer can produce reactive motion that is too close to the opponent, leading to penetration. CAMDM tends to get stuck, while Duolando may result in human motion with incorrect orientation after a certain period.
  • ...and 1 more figures