Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation

Lorenzo Mur Labadia; Ruben Martinez-Cantin; Jose J. Guerrero; Giovanni M. Farinella; Antonino Furnari

Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation

Lorenzo Mur Labadia, Ruben Martinez-Cantin, Jose J. Guerrero, Giovanni M. Farinella, Antonino Furnari

TL;DR

This work proposes STAformer and STAformer++, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair, and introduces two novel modules to ground STA predictions on human behavior by modeling affordances.

Abstract

Short Term object-interaction Anticipation consists in detecting the location of the next active objects, the noun and verb categories of the interaction, as well as the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants to understand user goals and provide timely assistance, or to enable human-robot interaction. In this work, we present a method to improve the performance of STA predictions. Our contributions are two-fold: 1 We propose STAformer and STAformer plus plus, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2 We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. We explore how to integrate environment affordances via simple late fusion and with an approach which adaptively learns how to best fuse affordances with end-to-end predictions. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant improvements on Overall Top-5 mAP, with gain up to +23p.p on Ego4D and +31p.p on a novel set of curated EPIC-Kitchens STA labels. We released the code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area.

Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation

TL;DR

Abstract

Paper Structure (33 sections, 10 equations, 10 figures, 10 tables)

This paper contains 33 sections, 10 equations, 10 figures, 10 tables.

Introduction
Introduction
Related works
Short-term Object Interaction Anticipation
Affordances for Anticipation
Object Detection Architectures
STAformer, a Transformer-based Architecture for Short-Term Anticipation
Problem Formulation
Feature Extraction
Frame-guided Temporal Pooling Attention (Figure \ref{['fig:encoder']}(a))
Dual Image-Video Attention fusion (Figure \ref{['fig:encoder']}(b))
Feature Fusion and Fast-RCNN based STA prediction head (Figure \ref{['fig:encoder']}(c)-(e)):
STAformer++: End-to-End Short-Term Anticipation with Transformers
Feature extraction
Per-Scale Frame-guided Temporal Pooling Attention and Feature Fusion (Figure \ref{['fig:encoder_detr']}(a-b))
...and 18 more sections

Figures (10)

Figure 1: (a) Our approach takes as input an image-video pair. (b) The input is processed by our novel STAformer++, and end-to-end short term anticipation model based on transformers which predicts object bounding boxes, the associated verb/noun probabilities, time-to-contact estimates and confidence scores. (c) The model learns to predict environment noun and verb affordances ($p_{\text{aff}}(n \lvert \mathcal{V'})$ and $p_{\text{aff}}(v \lvert \mathcal{V'})$ in a dynamic and flexible way during training. This representation are used to refine later the predicted noun/verbs to obtain the final predictions (e).
Figure 2: STAformer architecture. DINO-v2 and TimeSformer extract 2D and 3D features form the image-video input. (a) Frame-guided temporal pooling attention spatially aligns video to image features. (b) Dual image-video attention enriches 2D features with temporal dynamics and 3D features with fine-grained image details. Image and video representations are joined to obtain a global class token (c) and a feature pyramid (d), from which we obtain the STA predictions (e).
Figure 3: STAformer++ architecture. The Swin-T backbone extracts hierarchical multi-scale 2D feature maps from the high-resolution image, while the EgoVideo backbone extracts spatio-temporal 3D features. a) We compute per-scale Frame-guided temporal pooling, and then resize the pooled video tokens to the respective image map. b) The two feature maps are summed to obtain the fused feature pyramid $P_T$. c) The DETR Encoder enhances the features and applies the Mixed Query Selection to initialize the positional part of the object queries $\rho_m$, while the content parts are kept as learnable parameters. d) The DETR Decoder incorporates the refined image-video features to the object queries. We accelerate the convergence using a Contrastive DeNoising (CDN) part with positive and negative samples as proposed in li2022dn. e) The STA prediction head applies independent MLP layers to obtain the final predictions $(\hat{b}_m, \hat{n}_m, \hat{v}_m, \hat{\delta}_m, \hat{s}_m)$.
Figure 4: Environment affordances in forecasting. a) We build an affordance database by linking training videos according to their visual similarity, obtaining activity-centric zones with affordances values $V_{\textcolor{red}{j}}^{AFF}$ and respective video $\mathcal{Z}_j^\mathcal{V}$, text $\mathcal{Z}_j^\mathcal{T}$ descriptors. b) Our first approach matches the input encoded video $\Phi^\mathcal{V}(\mathcal{V}')$ to the affordance database by selecting the K nearest neighbors in terms of the cosine similarity with the visual $\mathcal{Z}^\mathcal{V}$ and text $\mathcal{Z}^\mathcal{V}$ zone descriptors. The affordance probability $p_{AFF}$ is obtained by weighting the counts of nouns present in the top-2K nearest zones ($\star$) according to the respective similarity $\mathcal{S}$. This will be late-fused with the predictions made by the end-to-end model. Example for K=2. c) In our second methodology, an attention mechanism ($Q^{AFF}, K^{AFF}$) learns to associate a novel video $\mathcal{V'}$ with all the potential zone candidates $Z_j$ in the affordance database. This dynamically obtains the noun $\mathcal{N}_{AFF}$ and verb $\mathcal{A}_{AFF}$ affordance distributions, which are summed to the DETR predicted nouns $n_m$ and verb $v_m$ logits during model training. The final binary class probabilities $p(n)_m$, $p(v)_m$ are obtained after a Sigmoid layer.
Figure 5: Refinement of confidence scores based on the interaction hotspots. The interaction hotspot model observes frames, hands, and objects and forecasts a map encoding the probability of the interaction in each pixel. STA confidence scores are re-weighted based on the probability values at the bounding box coordinate centers, reducing confidence in false positive predictions falling far from the interaction hotspot.
...and 5 more figures

Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation

TL;DR

Abstract

Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)