Table of Contents
Fetching ...

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero, Giovanni Maria Farinella, Antonino Furnari

TL;DR

The paper tackles Short-Term Object Interaction Anticipation (STA) from egocentric video and introduces STAformer, an image-video transformer that uses frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to predict next-active objects, noun/verb categories, time-to-contact, and confidence. It grounds predictions through two affordance modules: environment affordances via a zone memory that links functional zones to plausible interactions, and interaction hotspots that forecast where future interactions are likely to occur, enabling a fusion of $p_{ ext{aff}}(n| abla ext{V})$, $p_{ ext{aff}}(v| abla ext{V})$ with STA predictions $p_{ ext{sta}}(n| abla ext{V}, I')$ and $p_{ ext{sta}}(v| abla ext{V}, I')$, via unnormalized joint likelihoods $p_{ ext{fus}}(n|I', abla ext{V}) \\propto p_{ ext{aff}}(n| abla ext{V}) \, p_{ ext{sta}}(n| abla ext{V}, I')$ and similarly for verbs. The approach yields substantial relative improvements on Ego4D (up to +$45.0\%$ All mAP on v1, +$42.1\%$ on v2) and EPIC-Kitchens (up to +$42\%$ All mAP), and the authors provide open-source code, affordance databases, and new EPIC-Kitchens STA annotations to accelerate research.

Abstract

Short-Term object-interaction Anticipation consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants or human robot interaction to understand the user goals, but there is still room for improvement to perform STA in a precise and reliable way. In this work, we improve the performance of STA predictions with two contributions: 1. We propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair. 2. We introduce two novel modules to ground STA predictions on human behavior by modeling affordances.First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant relative Overall Top-5 mAP improvements of up to +45% on Ego4D and +42% on a novel set of curated EPIC-Kitchens STA labels. We will release the code, annotations, and pre extracted affordances on Ego4D and EPIC- Kitchens to encourage future research in this area.

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

TL;DR

The paper tackles Short-Term Object Interaction Anticipation (STA) from egocentric video and introduces STAformer, an image-video transformer that uses frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to predict next-active objects, noun/verb categories, time-to-contact, and confidence. It grounds predictions through two affordance modules: environment affordances via a zone memory that links functional zones to plausible interactions, and interaction hotspots that forecast where future interactions are likely to occur, enabling a fusion of , with STA predictions and , via unnormalized joint likelihoods and similarly for verbs. The approach yields substantial relative improvements on Ego4D (up to + All mAP on v1, + on v2) and EPIC-Kitchens (up to + All mAP), and the authors provide open-source code, affordance databases, and new EPIC-Kitchens STA annotations to accelerate research.

Abstract

Short-Term object-interaction Anticipation consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants or human robot interaction to understand the user goals, but there is still room for improvement to perform STA in a precise and reliable way. In this work, we improve the performance of STA predictions with two contributions: 1. We propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair. 2. We introduce two novel modules to ground STA predictions on human behavior by modeling affordances.First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant relative Overall Top-5 mAP improvements of up to +45% on Ego4D and +42% on a novel set of curated EPIC-Kitchens STA labels. We will release the code, annotations, and pre extracted affordances on Ego4D and EPIC- Kitchens to encourage future research in this area.
Paper Structure (13 sections, 4 equations, 6 figures, 6 tables)

This paper contains 13 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: (a) Our approach takes as input an image-video pair. (b) The input is processed by the proposed STAformer model which predicts object bounding boxes, the associated verb/noun probabilities, time-to-contact estimates and confidence scores. (c) Environment affordances are inferred from video and used to refine the predicted noun/verb probabilities. (d) Our model observes detected hand-object interactions in the video and predicts an interaction hotspot probability map, which is used to re-weigh confidence scores based on box locations, leading to (e) our final predictions.
  • Figure 2: STAformer architecture. DINO-v2 and TimeSformer extract 2D and 3D features form the image-video input. (a) Frame-guided temporal pooling attention spatially aligns video to image features. (b) Dual image-video attention enriches 2D features with temporal dynamics and 3D features with fine-grained image details. Image and video representations are joined to obtain a global class token (c) and a feature pyramid (d), from which we obtain the STA predictions (e).
  • Figure 3: Cross-environment inference of affordances: The input video $\mathcal{V}'$ is matched to the affordance database comparing its visual representation $\Psi^\mathcal{V}(\mathcal{V}')$ to the visual $Z^\mathcal{V}$ ($\circ$) and text $Z^\mathcal{T}$ ( $\square$) zone descriptors. The affordance noun probability $p_{\text{aff}}\left(n|\mathcal{V}'\right)$ is obtained by weighting the counts of nouns present in the top-2K nearest zones ($\star$) according to the respective similarity $\mathcal{S}$. Example for K=2.
  • Figure 4: Refinement of confidence scores based on the interaction hotspots. The interaction hotspot model observes frames, hands, and objects and forecasts a map encoding the probability of the interaction in each pixel. STA confidence scores are re-weighted based on the probability values at the bounding box coordinate centers, reducing confidence in false positive predictions falling far from the interaction hotspot.
  • Figure 5: Predicted environment affordances: Linking across functionally similar environments ($\mathcal{K}^\mathcal{V}$, $\mathcal{K}^\mathcal{T}$) creates a robust affordance representation which captures the STA interaction. We show in orange the STA ground-truth label.
  • ...and 1 more figures