Table of Contents
Fetching ...

ECHO: Ego-Centric modeling of Human-Object interactions

Ilya A. Petrov, Vladimir Guzov, Riccardo Marin, Emre Aksan, Xu Chen, Daniel Cremers, Thabo Beeler, Gerard Pons-Moll

Abstract

Modeling human-object interactions (HOI) from an egocentric perspective is a critical yet challenging task, particularly when relying on sparse signals from wearable devices like smart glasses and watches. We present ECHO, the first unified framework to jointly recover human pose, object motion, and contact dynamics solely from head and wrist tracking. To tackle the underconstrained nature of this problem, we introduce a novel tri-variate diffusion process with independent noise schedules that models the mutual dependencies between the human, object, and interaction modalities. This formulation allows ECHO to operate with flexible input configurations, making it robust to intermittent tracking and capable of leveraging partial observations. Crucially, it enables training on a combination of large-scale human motion datasets and smaller HOI collections, learning strong priors while capturing interaction nuances. Furthermore, we employ a smooth inpainting inference mechanism that enables the generation of temporally consistent interactions for arbitrarily long sequences. Extensive evaluations demonstrate that ECHO achieves state-of-the-art performance, significantly outperforming existing methods lacking such flexibility.

ECHO: Ego-Centric modeling of Human-Object interactions

Abstract

Modeling human-object interactions (HOI) from an egocentric perspective is a critical yet challenging task, particularly when relying on sparse signals from wearable devices like smart glasses and watches. We present ECHO, the first unified framework to jointly recover human pose, object motion, and contact dynamics solely from head and wrist tracking. To tackle the underconstrained nature of this problem, we introduce a novel tri-variate diffusion process with independent noise schedules that models the mutual dependencies between the human, object, and interaction modalities. This formulation allows ECHO to operate with flexible input configurations, making it robust to intermittent tracking and capable of leveraging partial observations. Crucially, it enables training on a combination of large-scale human motion datasets and smaller HOI collections, learning strong priors while capturing interaction nuances. Furthermore, we employ a smooth inpainting inference mechanism that enables the generation of temporally consistent interactions for arbitrarily long sequences. Extensive evaluations demonstrate that ECHO achieves state-of-the-art performance, significantly outperforming existing methods lacking such flexibility.

Paper Structure

This paper contains 48 sections, 25 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: ECHO. Inferring complex interactions from sparse wearable signals is challenging. ECHO is the first method to jointly recover full-body Human-Object Interaction sequences (top) solely from sparse 3-point tracking. Our flexible framework supports various inference modes (bottom), leveraging partial or intermittent observations (shown in red) of human pose, object trajectory, or contact dynamics.
  • Figure 2: Representation. ECHO operates in a per-frame head-centric coordinate system.
  • Figure 3: ECHO overview. ECHO requires just head and hand tracking and and object class, to predict Human, Object, and Interaction. The input tokens are composed of condition, and of either observed modality, or noise for $\mathcal{H} , \mathcal{O} , \text{and } \mathcal{I}$. For every modality, we use a unique denoising step. Our model allows flexible input configuration. In the example above we use contacts $\mathcal{I}$ as an additional input to the network, that infers the other modalities$\mathcal{H} \text{ and } \mathcal{O}$, matching the extended condition.
  • Figure 4: Comparison of inference strategies. Standard per-window inference (left) ignores the context of the past predictions. Inpainting (middle) uses past prediction as condition but drops new predictions for the overlapping region. Our smooth inpainting (right) blends past and current predictions in the overlapping region on every diffusion step, ensuring seamless transitions.
  • Figure 5: Qualitative results of ECHO. Our method accurately reconstructs human-object interactions across diverse scenarios. In contrast, competing methods often fail to capture correct contact dynamics, leading to artifacts such as object penetration or floating. For dynamic visualizations, please refer to the supplementary video.
  • ...and 2 more figures