ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

Lorenzo Mur-Labadia; Ruben Martinez-Cantin; Josechu Guerrero-Campo; Giovanni Maria Farinella

ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero-Campo, Giovanni Maria Farinella

TL;DR

The paper tackles Short-Term Object Interaction Anticipation (STA) in egocentric video, aiming to predict the next active object, its noun/verb labels, bounding box, and time-to-contact. It proposes STAformer, an attention-based architecture that fuses image and video inputs via frame-guided temporal pooling, dual image-video attention, and multi-scale fusion. Two grounding modules are introduced: environment affordances, derived from an activity-centric zone database to refine noun/verb predictions, and interaction hotspots, which reweight predictions based on likely interaction locations. On Ego4D benchmarks, STAformer achieves state-of-the-art results, including 33.5 $N$ mAP, 17.25 $N+V$ mAP, 11.77 $N+\delta$ mAP, and 6.75 Overall top-5 mAP on the test set, with ablations confirming the benefit of both affordances and hotspot guidance. These contributions advance STA by linking predictions to observable human behavior and environmental constraints for more reliable anticipation in assistive and robotics contexts.

Abstract

Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to support STA predictions from an image-input video pair. Moreover, we introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. On the test set, our results obtain a final 33.5 N mAP, 17.25 N+V mAP, 11.77 N+δ mAP and 6.75 Overall top-5 mAP metric when trained on the v2 training dataset.

ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

TL;DR

mAP, 17.25

mAP, 11.77

mAP, and 6.75 Overall top-5 mAP on the test set, with ablations confirming the benefit of both affordances and hotspot guidance. These contributions advance STA by linking predictions to observable human behavior and environmental constraints for more reliable anticipation in assistive and robotics contexts.

Abstract

Paper Structure (7 sections, 2 figures, 3 tables)

This paper contains 7 sections, 2 figures, 3 tables.

Introduction
Methods
STAformer Architecture
Leveraging environment affordances.
Leveraging interaction hotspots.
Results
Conclusions

Figures (2)

Figure 1: (a-b) The image-video pair input is processed by the proposed STAformer model which predicts object bounding boxes, the associated verb/noun probabilities, time-to-contact and confidence scores.(c) Environment affordances are inferred from video and used to refine the predicted noun/verb probabilities. (d) Our model observes detected hand-object interactions in the video and predicts an interaction hotspot probability map, which is used to re-weigh confidence scores based on box locations, leading to (e) our final predictions.
Figure 2: Qualitative results on Ego4D.

ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

TL;DR

Abstract

ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

Authors

TL;DR

Abstract

Table of Contents

Figures (2)