InterTrack: Tracking Human Object Interaction without Object Templates
Xianghui Xie, Jan Eric Lenssen, Gerard Pons-Moll
TL;DR
InterTrack tackles monocular human–object interaction tracking without object templates by decomposing 4D reconstruction into per-frame pose estimation and global shape optimization. It introduces CorrAE for temporally consistent human registration and TOPNet for temporally stable object rotations, followed by joint optimization constrained by contact. A synthetic video engine, ProciGen-V, provides large-scale training data that generalizes well to real videos, enabling strong performance on BEHAVE and InterCap while outperforming template-based and template-free baselines. The approach yields robust tracking under occlusion and dynamic motion, with practical impact for scalable HOI analysis using only RGB video and synthetic data.
Abstract
Tracking human object interaction from videos is important to understand human behavior from the rapidly growing stream of video data. Previous video-based methods require predefined object templates while single-image-based methods are template-free but lack temporal consistency. In this paper, we present a method to track human object interaction without any object shape templates. We decompose the 4D tracking problem into per-frame pose tracking and canonical shape optimization. We first apply a single-view reconstruction method to obtain temporally-inconsistent per-frame interaction reconstructions. Then, for the human, we propose an efficient autoencoder to predict SMPL vertices directly from the per-frame reconstructions, introducing temporally consistent correspondence. For the object, we introduce a pose estimator that leverages temporal information to predict smooth object rotations under occlusions. To train our model, we propose a method to generate synthetic interaction videos and synthesize in total 10 hour videos of 8.5k sequences with full 3D ground truth. Experiments on BEHAVE and InterCap show that our method significantly outperforms previous template-based video tracking and single-frame reconstruction methods. Our proposed synthetic video dataset also allows training video-based methods that generalize to real-world videos. Our code and dataset will be publicly released.
