Table of Contents
Fetching ...

Interaction Region Visual Transformer for Egocentric Action Anticipation

Debaditya Roy, Ramanathan Rajendiran, Basura Fernando

TL;DR

This work addresses egocentric action anticipation by modeling visual changes in hands and interacted objects to refine video representations. It introduces InAViT, a spatio-temporal transformer that builds interaction tokens via three interaction-region modeling schemes (SCA, SOT, UB), and then contextually refines these tokens using Trajectory Cross Attention to fuse with scene context, forming an interaction-centric video representation processed by MotionFormer. The approach yields state-of-the-art performance on EK100 and EGTEA Gaze+, with a notable 3.3% mean-top5 recall improvement on EK100, and shows strong robustness for longer anticipation windows. The findings highlight the value of explicitly capturing hand–object appearance changes and environment context for predicting forthcoming actions, offering a scalable, transformer-based framework for egocentric action anticipation.

Abstract

Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.

Interaction Region Visual Transformer for Egocentric Action Anticipation

TL;DR

This work addresses egocentric action anticipation by modeling visual changes in hands and interacted objects to refine video representations. It introduces InAViT, a spatio-temporal transformer that builds interaction tokens via three interaction-region modeling schemes (SCA, SOT, UB), and then contextually refines these tokens using Trajectory Cross Attention to fuse with scene context, forming an interaction-centric video representation processed by MotionFormer. The approach yields state-of-the-art performance on EK100 and EGTEA Gaze+, with a notable 3.3% mean-top5 recall improvement on EK100, and shows strong robustness for longer anticipation windows. The findings highlight the value of explicitly capturing hand–object appearance changes and environment context for predicting forthcoming actions, offering a scalable, transformer-based framework for egocentric action anticipation.

Abstract

Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.
Paper Structure (23 sections, 9 equations, 8 figures, 10 tables)

This paper contains 23 sections, 9 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Block diagram of our interaction region modeling based action anticipation (InAVIT). Ellipses represent processing mechanisms. Trajectory attention $\times12$ refers to MotionFormer.
  • Figure 2: Modeling interaction region tokens using Spatial Cross Attention. In every frame, hand tokens act as query and object tokens as key and value to compute refined hand tokens. Refined object tokens are computed with object token as query, and hand and other object tokens as key and values (not shown here to avoid clutter). Interaction tokens consist of refined hand and object tokens.
  • Figure 3: Modeling interaction tokens using Self-attention Over Time. We compute self-attention over hand tokens in all the frames to obtain refined hand tokens. We repeat this for every object region across frames. The refined hand and object tokens at every frame are the interaction tokens.
  • Figure 4: Context infusion into interaction region tokens using Trajectory Cross Attention. We compute spatial cross-attention (SCA) to find the best location for interaction trajectory by comparing the interaction query to context keys. Next, we pool the interaction trajectories across time to form connections across the interaction tokens in a frame.
  • Figure 5: InAViT attends to the location(s) where the next action will occur in (a) onion and (b) cup and sugar.
  • ...and 3 more figures