Table of Contents
Fetching ...

End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

Haoyu Zhang, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha

Abstract

Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical coherence. Extensive experiments demonstrate that THO operates at an inference speed of 31.5 FPS on a single RTX 4090 GPU, achieving a >600x speedup over prior optimization-based methods while simultaneously improving reconstruction accuracy and temporal consistency. The project page is available at: https://nianheng.github.io/THO-project/

End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

Abstract

Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical coherence. Extensive experiments demonstrate that THO operates at an inference speed of 31.5 FPS on a single RTX 4090 GPU, achieving a >600x speedup over prior optimization-based methods while simultaneously improving reconstruction accuracy and temporal consistency. The project page is available at: https://nianheng.github.io/THO-project/
Paper Structure (34 sections, 10 equations, 7 figures, 3 tables)

This paper contains 34 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview. THO enables real-time 4D HOI reconstruction (31.5 FPS) by introducing a Spatial-Temporal Transformer to capture contact and interaction dynamics.
  • Figure 2: Overall pipeline of THO. From a monocular video, GVHMR extracts human priors and Interaction-Centric Crops ($\mathcal{I}^{crop}$). The 3D Vertex Encoder generates unified embeddings ($F^J, F^H, F^O$). Spatially, SCAT recovers occluded $F^O$ and injects $F^H$ via contact-aware cues. Fusing refined $f^O$ with joint and global contexts yields an aggregated motion token ($f^\text{token}$). Temporally, TIAT processes this token sequence to model temporal dynamics ($\hat{f}^\text{token}$), which MLPs decode into 4D HOI reconstruction.
  • Figure 3: Qualitative comparison on BEHAVE bhatnagar2022behave. THO mitigates inaccurate symmetric object pose predictions (Rows 1-2), yields more reasonable poses under severe occlusion (Row 3), and ensures better human-object contact (Row 4).
  • Figure 4: Efficiency vs. Performance on BEHAVE. THO achieves real-time inference (31.5 FPS), outperforming all baselines not only in inference speed but also yielding the best temporal smoothness ($\text{Acc}_h, \text{Acc}_o$) as shown in (a), and the highest reconstruction accuracy ($\text{CD}_c$) as shown in (b).
  • Figure 5: Impact of SCAT. OR recovers complete object geometry from occluded views, while HI corrects the relative pose to ensure physically consistent human-object contact, eliminating the misalignment observed in the baseline.
  • ...and 2 more figures