ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA
Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero-Campo, Giovanni Maria Farinella
TL;DR
The paper tackles Short-Term Object Interaction Anticipation (STA) in egocentric video, aiming to predict the next active object, its noun/verb labels, bounding box, and time-to-contact. It proposes STAformer, an attention-based architecture that fuses image and video inputs via frame-guided temporal pooling, dual image-video attention, and multi-scale fusion. Two grounding modules are introduced: environment affordances, derived from an activity-centric zone database to refine noun/verb predictions, and interaction hotspots, which reweight predictions based on likely interaction locations. On Ego4D benchmarks, STAformer achieves state-of-the-art results, including 33.5 $N$ mAP, 17.25 $N+V$ mAP, 11.77 $N+\delta$ mAP, and 6.75 Overall top-5 mAP on the test set, with ablations confirming the benefit of both affordances and hotspot guidance. These contributions advance STA by linking predictions to observable human behavior and environmental constraints for more reliable anticipation in assistive and robotics contexts.
Abstract
Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to support STA predictions from an image-input video pair. Moreover, we introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. On the test set, our results obtain a final 33.5 N mAP, 17.25 N+V mAP, 11.77 N+δ mAP and 6.75 Overall top-5 mAP metric when trained on the v2 training dataset.
