Table of Contents
Fetching ...

InterTrack: Tracking Human Object Interaction without Object Templates

Xianghui Xie, Jan Eric Lenssen, Gerard Pons-Moll

TL;DR

InterTrack tackles monocular human–object interaction tracking without object templates by decomposing 4D reconstruction into per-frame pose estimation and global shape optimization. It introduces CorrAE for temporally consistent human registration and TOPNet for temporally stable object rotations, followed by joint optimization constrained by contact. A synthetic video engine, ProciGen-V, provides large-scale training data that generalizes well to real videos, enabling strong performance on BEHAVE and InterCap while outperforming template-based and template-free baselines. The approach yields robust tracking under occlusion and dynamic motion, with practical impact for scalable HOI analysis using only RGB video and synthetic data.

Abstract

Tracking human object interaction from videos is important to understand human behavior from the rapidly growing stream of video data. Previous video-based methods require predefined object templates while single-image-based methods are template-free but lack temporal consistency. In this paper, we present a method to track human object interaction without any object shape templates. We decompose the 4D tracking problem into per-frame pose tracking and canonical shape optimization. We first apply a single-view reconstruction method to obtain temporally-inconsistent per-frame interaction reconstructions. Then, for the human, we propose an efficient autoencoder to predict SMPL vertices directly from the per-frame reconstructions, introducing temporally consistent correspondence. For the object, we introduce a pose estimator that leverages temporal information to predict smooth object rotations under occlusions. To train our model, we propose a method to generate synthetic interaction videos and synthesize in total 10 hour videos of 8.5k sequences with full 3D ground truth. Experiments on BEHAVE and InterCap show that our method significantly outperforms previous template-based video tracking and single-frame reconstruction methods. Our proposed synthetic video dataset also allows training video-based methods that generalize to real-world videos. Our code and dataset will be publicly released.

InterTrack: Tracking Human Object Interaction without Object Templates

TL;DR

InterTrack tackles monocular human–object interaction tracking without object templates by decomposing 4D reconstruction into per-frame pose estimation and global shape optimization. It introduces CorrAE for temporally consistent human registration and TOPNet for temporally stable object rotations, followed by joint optimization constrained by contact. A synthetic video engine, ProciGen-V, provides large-scale training data that generalizes well to real videos, enabling strong performance on BEHAVE and InterCap while outperforming template-based and template-free baselines. The approach yields robust tracking under occlusion and dynamic motion, with practical impact for scalable HOI analysis using only RGB video and synthetic data.

Abstract

Tracking human object interaction from videos is important to understand human behavior from the rapidly growing stream of video data. Previous video-based methods require predefined object templates while single-image-based methods are template-free but lack temporal consistency. In this paper, we present a method to track human object interaction without any object shape templates. We decompose the 4D tracking problem into per-frame pose tracking and canonical shape optimization. We first apply a single-view reconstruction method to obtain temporally-inconsistent per-frame interaction reconstructions. Then, for the human, we propose an efficient autoencoder to predict SMPL vertices directly from the per-frame reconstructions, introducing temporally consistent correspondence. For the object, we introduce a pose estimator that leverages temporal information to predict smooth object rotations under occlusions. To train our model, we propose a method to generate synthetic interaction videos and synthesize in total 10 hour videos of 8.5k sequences with full 3D ground truth. Experiments on BEHAVE and InterCap show that our method significantly outperforms previous template-based video tracking and single-frame reconstruction methods. Our proposed synthetic video dataset also allows training video-based methods that generalize to real-world videos. Our code and dataset will be publicly released.
Paper Structure (26 sections, 5 equations, 13 figures, 6 tables)

This paper contains 26 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: From a monocular RGB video, our method tracks the human and object under occlusion and dynamic motions, without using any object templates. Our method is trained only on synthetic data and generalizes well to real-world videos captured by mobile phones.
  • Figure 2: Method Overview. Given an image sequence of human object interaction and HDM xie2023template_free reconstructions (A), we aim at obtaining coherent tracking of the human and object across frames (E). We first use a simple yet efficient autoencoder CorrAE to obtain coherent humans points and optimize human via the SMPL layer (B, \ref{['subsec:human-recon']}). We then use a temporal object pose estimator TOPNet to predict the object rotation, which allows us to optimize a common object shape in canonical space and fine tune pose predictions (C, \ref{['subsec:object-recon']}). We then jointly optimize human and object based on contacts to obtain consistent tracking (D, \ref{['subsec:joint-track']}).
  • Figure 3: Comparing our method against VisTracker xie2023vistracker and HDM xie2023template_free on BAHEVE (row1-2) bhatnagar22behave and InterCap (row3-4) huang2022intercap. VisTracker relies on post-hoc processing to refine object pose which is inaccurate and HDM reconstructs inconsistent object shapes (row 1-2) or interactions (row 3-4). Our temporal based pose estimation and optimization leads to consistent shape and interaction.
  • Figure 4: The effects of 2D and 3D losses for object optimization. Without the2D mask loss, the object shape is very noisy and without 3D chamfer loss the relative object position is incorrect.
  • Figure 5: Ablating the influence of the contact-based refinement. Without contact, the hand and object can be far apart, leading to implausible interaction.
  • ...and 8 more figures