Reconstructing Objects along Hand Interaction Timelines in Egocentric Video
Zhifan Zhu, Siddhant Bansal, Shashank Tripathi, Dima Damen
TL;DR
This work defines ROHIT, a task to reconstruct 3D object poses along complete Hand Interaction Timelines in egocentric video, and introduces the Constrained Optimisation and Propagation (COP) framework to optimise and propagate object poses across Static, Unstable, and Stable Grasp segments. Using two new datasets, HOT3D-HIT (with 3D ground truth) and EPIC-HIT (in the wild), COP achieves consistent improvements in stable grasp and HIT reconstruction without requiring full 3D supervision. The approach balances segment-specific constraints (E_mask, E_SG, E_push, E_pull) and temporal propagation to enable robust, timeline-aware hand-object reconstruction, with strong results on both controlled and in-the-wild data and a clear path toward CAD-agnostic future work.
Abstract
We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.
