Table of Contents
Fetching ...

HERO: Human Reaction Generation from Videos

Chengjun Yu, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha

TL;DR

The paper tackles 3D human reaction generation from RGB videos by introducing HERO, a framework that proactively extracts interaction intention from video representations to guide reactive motion synthesis. It combines a TC-CLIP-based video encoder, a Motion VQ-VAE for discrete motion tokens, and a reaction generation module with masked motion modeling, global-local cross-attention, and intention-conditioned guidance. The authors also present ViMo, a large dataset of 3,500 video-motion pairs across human-human, animal-human, and scene-human interactions to support this task. Experiments show HERO outperforms baselines on FID, diversity, and multimodality, with qualitative and user-study evidence of improved plausibility and quality. This work broadens interactive AI capabilities by enabling emotion-aware, multi-category reaction generation from unconstrained video inputs, with practical implications for embodied AI and interactive systems.

Abstract

Human reaction generation represents a significant research domain for interactive AI, as humans constantly interact with their surroundings. Previous works focus mainly on synthesizing the reactive motion given a human motion sequence. This paradigm limits interaction categories to human-human interactions and ignores emotions that may influence reaction generation. In this work, we propose to generate 3D human reactions from RGB videos, which involves a wider range of interaction categories and naturally provides information about expressions that may reflect the subject's emotions. To cope with this task, we present HERO, a simple yet powerful framework for Human rEaction geneRation from videOs. HERO considers both global and frame-level local representations of the video to extract the interaction intention, and then uses the extracted interaction intention to guide the synthesis of the reaction. Besides, local visual representations are continuously injected into the model to maximize the exploitation of the dynamic properties inherent in videos. Furthermore, the ViMo dataset containing paired Video-Motion data is collected to support the task. In addition to human-human interactions, these video-motion pairs also cover animal-human interactions and scene-human interactions. Extensive experiments demonstrate the superiority of our methodology. The code and dataset will be publicly available at https://jackyu6.github.io/HERO.

HERO: Human Reaction Generation from Videos

TL;DR

The paper tackles 3D human reaction generation from RGB videos by introducing HERO, a framework that proactively extracts interaction intention from video representations to guide reactive motion synthesis. It combines a TC-CLIP-based video encoder, a Motion VQ-VAE for discrete motion tokens, and a reaction generation module with masked motion modeling, global-local cross-attention, and intention-conditioned guidance. The authors also present ViMo, a large dataset of 3,500 video-motion pairs across human-human, animal-human, and scene-human interactions to support this task. Experiments show HERO outperforms baselines on FID, diversity, and multimodality, with qualitative and user-study evidence of improved plausibility and quality. This work broadens interactive AI capabilities by enabling emotion-aware, multi-category reaction generation from unconstrained video inputs, with practical implications for embodied AI and interactive systems.

Abstract

Human reaction generation represents a significant research domain for interactive AI, as humans constantly interact with their surroundings. Previous works focus mainly on synthesizing the reactive motion given a human motion sequence. This paradigm limits interaction categories to human-human interactions and ignores emotions that may influence reaction generation. In this work, we propose to generate 3D human reactions from RGB videos, which involves a wider range of interaction categories and naturally provides information about expressions that may reflect the subject's emotions. To cope with this task, we present HERO, a simple yet powerful framework for Human rEaction geneRation from videOs. HERO considers both global and frame-level local representations of the video to extract the interaction intention, and then uses the extracted interaction intention to guide the synthesis of the reaction. Besides, local visual representations are continuously injected into the model to maximize the exploitation of the dynamic properties inherent in videos. Furthermore, the ViMo dataset containing paired Video-Motion data is collected to support the task. In addition to human-human interactions, these video-motion pairs also cover animal-human interactions and scene-human interactions. Extensive experiments demonstrate the superiority of our methodology. The code and dataset will be publicly available at https://jackyu6.github.io/HERO.

Paper Structure

This paper contains 18 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of our work. We propose to generate 3D human reactions from RGB videos. To tackle this task, a simple yet powerful framework, HERO, is presented. Furthermore, to facilitate research in this area, we introduce the ViMo dataset, which features a wide range of interaction categories covering three broad ones: human-human interactions, animal-human interactions, and scene-human interactions. For the human reactions visualized in the figure, the darker colors indicate the later in time.
  • Figure 2: The pipeline of HERO. During training, the video and GT reactive motion are input into HERO. As for inference, only the video is provided. Note that we omit the residual motion refinement (See the end of \ref{['sec:3.3']}) from the figure for clarity.
  • Figure 3: ViMo dataset contains 32 subcategorized interactions, each belonging to one of three broad categories: human-human interactions, animal-human interactions, and scene-human interactions. Among them, human-human interactions cover daily socializing, sports, physical confrontations, and others.
  • Figure 4: User study results. The higher the scores, the better.
  • Figure 5: Visual comparisons between the different methods given three distinct videos from ViMo test set.
  • ...and 5 more figures