Table of Contents
Fetching ...

Reconstructing Objects along Hand Interaction Timelines in Egocentric Video

Zhifan Zhu, Siddhant Bansal, Shashank Tripathi, Dima Damen

TL;DR

This work defines ROHIT, a task to reconstruct 3D object poses along complete Hand Interaction Timelines in egocentric video, and introduces the Constrained Optimisation and Propagation (COP) framework to optimise and propagate object poses across Static, Unstable, and Stable Grasp segments. Using two new datasets, HOT3D-HIT (with 3D ground truth) and EPIC-HIT (in the wild), COP achieves consistent improvements in stable grasp and HIT reconstruction without requiring full 3D supervision. The approach balances segment-specific constraints (E_mask, E_SG, E_push, E_pull) and temporal propagation to enable robust, timeline-aware hand-object reconstruction, with strong results on both controlled and in-the-wild data and a clear path toward CAD-agnostic future work.

Abstract

We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.

Reconstructing Objects along Hand Interaction Timelines in Egocentric Video

TL;DR

This work defines ROHIT, a task to reconstruct 3D object poses along complete Hand Interaction Timelines in egocentric video, and introduces the Constrained Optimisation and Propagation (COP) framework to optimise and propagate object poses across Static, Unstable, and Stable Grasp segments. Using two new datasets, HOT3D-HIT (with 3D ground truth) and EPIC-HIT (in the wild), COP achieves consistent improvements in stable grasp and HIT reconstruction without requiring full 3D supervision. The approach balances segment-specific constraints (E_mask, E_SG, E_push, E_pull) and temporal propagation to enable robust, timeline-aware hand-object reconstruction, with strong results on both controlled and in-the-wild data and a clear path toward CAD-agnostic future work.

Abstract

We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.

Paper Structure

This paper contains 26 sections, 9 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Sample HIT sequence from HOT3D banerjee2025hot3d with reconstruction results by our method. We illustrate the three types of temporal segments in hand-object interactions: Static: where the object is static relative to the scene, Unstable Contact: where the hand is firming its grip on the object; and Stable Grasp: where hand is securely holding the object stably, until it is Static again when put down. The plot illustrates the IoU of the in-contact vertices across neighbour frames; for formal definition, refer to \ref{['sec:hoi_timeline_definition']}.
  • Figure 2: Qualitative results. Given a Hand Interaction Timeline (HIT) - with an object in Static, Unstable Contact and Stable Grasp interaction segments, our method, COP, reconstructs hand (blue) and object (yellow) meshes along the HIT. We show input frames (left), projected meshes (middle) and meshes in 3D world coordinate system (right). Rows 1-2 from EPIC-HIT and row 3 from HOT3D-HIT.
  • Figure 3: Stable Grasp Intuition. Three samples from HOT3D. In each row, we align the hand coordinate system for three frames from one stable grasp. Left: finger articulations and object pose vary over time. Right: contact area (shown as a heat map of objects vertices in contact with the hand) remains consistent.
  • Figure 4: Optimising the Stable Grasp segment. We show three frames within one stable grasp. We utilise HaMeR pavlakos2024reconstructing to reconstruct the hand mesh in-the-wild. We initialise $T_{o2h}^{n}$ to one $T_{o2h}$, but keep the diverse finger articulations from the hand pose estimates. During optimisation, we measure the distance between each hand vertex and object vertices $V_o$. In the left plot, we show that the contact area $d_{oh} \approx 0$ differs over time (visualised on the gray bottle). The novel loss, $E_{SG}$, minimises the variation of distance between the hand and object vertices over time, by adjusting the object's pose relative to the hand. As $E_{SG}$ is minimised, the contact area is aligned (see updated plot). Additional losses are used to regularise the optimisation: $E_{mask}$ renders the reconstruction and compares it against estimated object masks while $E_{push}$ and $E_{pull}$ respectively ensure the object is not penetrated by or away from the hand.
  • Figure 5: Sample propagation (e.g. from Static to Stable Grasp).
  • ...and 7 more figures