Table of Contents
Fetching ...

ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, Ajmal Mian

TL;DR

This work introduces ARMFlow, a MeanFlow-based autoregressive framework for real-time 3D human reaction generation, featuring a causal context encoder and Bootstrap Context Encoding to maintain long-term semantic coherence while avoiding error accumulation. It unifies online generation (ARMFlow) with an offline baseline (ReMFlow) built on a DiT backbone and a CNN-VAE for motion compression, achieving single-step inference and state-of-the-art results on both fidelity and semantic alignment. Extensive experiments on InterHuman and InterX show ARMFlow surpasses online baselines in FID and R-Precision, while ReMFlow delivers fastest offline inference with competitive quality. The work advances practical reaction generation for HRI, AR, and VR by delivering efficient, robust, and context-aware motion synthesis in both online and offline settings, and highlights future directions like handling elastic delays and post-hoc guidance.

Abstract

3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.

ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

TL;DR

This work introduces ARMFlow, a MeanFlow-based autoregressive framework for real-time 3D human reaction generation, featuring a causal context encoder and Bootstrap Context Encoding to maintain long-term semantic coherence while avoiding error accumulation. It unifies online generation (ARMFlow) with an offline baseline (ReMFlow) built on a DiT backbone and a CNN-VAE for motion compression, achieving single-step inference and state-of-the-art results on both fidelity and semantic alignment. Extensive experiments on InterHuman and InterX show ARMFlow surpasses online baselines in FID and R-Precision, while ReMFlow delivers fastest offline inference with competitive quality. The work advances practical reaction generation for HRI, AR, and VR by delivering efficient, robust, and context-aware motion synthesis in both online and offline settings, and highlights future directions like handling elastic delays and post-hoc guidance.

Abstract

3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.

Paper Structure

This paper contains 21 sections, 5 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Our method only processes a single inference in each real-time step for online reaction generation, compared to the SOTA methods ReGenNet (35-78 ms), and CAMDM (45 ms). The text description is from the InterX interx dataset.
  • Figure 2: Overview of the proposed architecture for online and offline reaction generation. The framework consists of a CNN-based encoder to learn a compact latent space for the actor and the reactor. The ReMFlow is for offline generation based on the DiT architecture, and ARMFlow is the autoregressive online model consisting of a DiT context encoder and an MLP velocity predictor. A BSCE strategy is employed during online training progressively to reduce accumulated error in the autoregression.
  • Figure 3: Qualitative comparison with ReGenNet on InterHuman dataset. The problematic interactions are marked with red dashed lines, including penetrations and semantic misalignment, and the correct ones are marked with green.