Table of Contents
Fetching ...

Bidirectional Progressive Transformer for Interaction Intention Anticipation

Zichen Zhang, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang

TL;DR

This work tackles joint interaction-intention anticipation in egocentric videos by predicting both future hand trajectories $\mathcal{H}$ and interaction hotspots $\mathcal{O}_I$ with a Bidirectional prOgressive Transformer (BOT). The method leverages a Spatial-Temporal Reconstruction Module to suppress view-change conflicts, dual independent branches for trajectories and hotspots, and a Bi-Progressive Enhancement module with cross-attention for mutual, time-step refinements, complemented by a Trajectory Stochastic Unit and a CVAE to inject realistic uncertainty. It achieves state-of-the-art results on three benchmarks—Epic-Kitchens-100, EGO4D, and EGTEA Gaze+—demonstrating strong accuracy in both trajectory forecasting (ADE/FDE) and hotspot localization (SIM, AUC-J, NSS) and showing robust performance under sampling-based inference. The approach emphasizes the intrinsic coupling between hand motion and contact regions, enabling continuous correction and uncertainty-aware predictions that are well-suited for downstream embodied AI tasks.

Abstract

Interaction intention anticipation aims to jointly predict future hand trajectories and interaction hotspots. Existing research often treated trajectory forecasting and interaction hotspots prediction as separate tasks or solely considered the impact of trajectories on interaction hotspots, which led to the accumulation of prediction errors over time. However, a deeper inherent connection exists between hand trajectories and interaction hotspots, which allows for continuous mutual correction between them. Building upon this relationship, a novel Bidirectional prOgressive Transformer (BOT), which introduces a Bidirectional Progressive mechanism into the anticipation of interaction intention is established. Initially, BOT maximizes the utilization of spatial information from the last observation frame through the Spatial-Temporal Reconstruction Module, mitigating conflicts arising from changes of view in first-person videos. Subsequently, based on two independent prediction branches, a Bidirectional Progressive Enhancement Module is introduced to mutually improve the prediction of hand trajectories and interaction hotspots over time to minimize error accumulation. Finally, acknowledging the intrinsic randomness in human natural behavior, we employ a Trajectory Stochastic Unit and a C-VAE to introduce appropriate uncertainty to trajectories and interaction hotspots, respectively. Our method achieves state-of-the-art results on three benchmark datasets Epic-Kitchens-100, EGO4D, and EGTEA Gaze+, demonstrating superior in complex scenarios.

Bidirectional Progressive Transformer for Interaction Intention Anticipation

TL;DR

This work tackles joint interaction-intention anticipation in egocentric videos by predicting both future hand trajectories and interaction hotspots with a Bidirectional prOgressive Transformer (BOT). The method leverages a Spatial-Temporal Reconstruction Module to suppress view-change conflicts, dual independent branches for trajectories and hotspots, and a Bi-Progressive Enhancement module with cross-attention for mutual, time-step refinements, complemented by a Trajectory Stochastic Unit and a CVAE to inject realistic uncertainty. It achieves state-of-the-art results on three benchmarks—Epic-Kitchens-100, EGO4D, and EGTEA Gaze+—demonstrating strong accuracy in both trajectory forecasting (ADE/FDE) and hotspot localization (SIM, AUC-J, NSS) and showing robust performance under sampling-based inference. The approach emphasizes the intrinsic coupling between hand motion and contact regions, enabling continuous correction and uncertainty-aware predictions that are well-suited for downstream embodied AI tasks.

Abstract

Interaction intention anticipation aims to jointly predict future hand trajectories and interaction hotspots. Existing research often treated trajectory forecasting and interaction hotspots prediction as separate tasks or solely considered the impact of trajectories on interaction hotspots, which led to the accumulation of prediction errors over time. However, a deeper inherent connection exists between hand trajectories and interaction hotspots, which allows for continuous mutual correction between them. Building upon this relationship, a novel Bidirectional prOgressive Transformer (BOT), which introduces a Bidirectional Progressive mechanism into the anticipation of interaction intention is established. Initially, BOT maximizes the utilization of spatial information from the last observation frame through the Spatial-Temporal Reconstruction Module, mitigating conflicts arising from changes of view in first-person videos. Subsequently, based on two independent prediction branches, a Bidirectional Progressive Enhancement Module is introduced to mutually improve the prediction of hand trajectories and interaction hotspots over time to minimize error accumulation. Finally, acknowledging the intrinsic randomness in human natural behavior, we employ a Trajectory Stochastic Unit and a C-VAE to introduce appropriate uncertainty to trajectories and interaction hotspots, respectively. Our method achieves state-of-the-art results on three benchmark datasets Epic-Kitchens-100, EGO4D, and EGTEA Gaze+, demonstrating superior in complex scenarios.
Paper Structure (12 sections, 11 equations, 11 figures, 3 tables)

This paper contains 12 sections, 11 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Error accumulation when only considering the influence from hand trajectories to interaction hotspots. Red ones represent the ground truth, while Blue ones are predictions.
  • Figure 2: Inherent connection between different hand trajectory categories and their corresponding contact points distribution. This is conducted on Epic-Kitchens-100 Damen2018EPICKITCHENS.
  • Figure 3: Overview of the Bidirectional Progressive Transformer. It follows a dual-branch structure with a Bi-Progressive Enhancement Module between them to anticipate future hand trajectories and interaction hotspots.
  • Figure 4: Bi-Progressive Enhancement Module. The output feature map is enclosed in purple boxes.
  • Figure 5: Uncertain regions of the hand trajectory. It is derived through the intersection of two ellipses, corresponding to the Eq. \ref{['eq13']}-\ref{['eq14']}. (i.e., the purple region).
  • ...and 6 more figures