Table of Contents
Fetching ...

ReGenNet: Towards Human Action-Reaction Synthesis

Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, Wenjun Zeng

TL;DR

Quantitative and qualitative results show that the proposed diffusion-based generative model with a Trans-former decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors.

Abstract

Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects, while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper, we comprehensively analyze the asymmetric, dynamic, synchronous, and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with, we propose to annotate the actor-reactor order of the interaction sequences for the NTU120, InterHuman, and Chi3D datasets. Based on them, a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines, and can generalize to unseen actor motions and viewpoint changes.

ReGenNet: Towards Human Action-Reaction Synthesis

TL;DR

Quantitative and qualitative results show that the proposed diffusion-based generative model with a Trans-former decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors.

Abstract

Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects, while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper, we comprehensively analyze the asymmetric, dynamic, synchronous, and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with, we propose to annotate the actor-reactor order of the interaction sequences for the NTU120, InterHuman, and Chi3D datasets. Based on them, a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines, and can generalize to unseen actor motions and viewpoint changes.
Paper Structure (20 sections, 4 equations, 3 figures, 11 tables)

This paper contains 20 sections, 4 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Illustration of our proposed ReGenNet, i.e., given a human motion sequence and generate the plausible human reactions, which will have broad applications in AR/VR and games.
  • Figure 2: The architecture of our proposed ReGenNet which is formulated in a diffusion-based framework with Transformer Decoder Units. The gray panel of (a) illustrates the whole diffusion model with the "Forward Diffusion" process and a stack of $\ell_{dec}$ "Transformer Decoder Units" as the denoising process, the blue panel of (a) is the actor feature as the condition. (b) shows the details of the "Transformer Decoder Units" with directional attention mask for online reaction synthesis.
  • Figure 3: Visualization of human action-reaction synthesis results. Blue for actors and Orange for reactors.