StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation

Francesco Ragusa; Giovanni Maria Farinella; Antonino Furnari

StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation

Francesco Ragusa, Giovanni Maria Farinella, Antonino Furnari

TL;DR

This paper studied the short-term object interaction anticipation problem from the egocentric point of view, proposing a new end-to-end architecture named StillFast, which is ranked first in the public leaderboard of the EGO4D short term object interaction expectation challenge 2022 and it is the official baseline for the 2023 one.

Abstract

Anticipation problem has been studied considering different aspects such as predicting humans' locations, predicting hands and objects trajectories, and forecasting actions and human-object interactions. In this paper, we studied the short-term object interaction anticipation problem from the egocentric point of view, proposing a new end-to-end architecture named StillFast. Our approach simultaneously processes a still image and a video detecting and localizing next-active objects, predicting the verb which describes the future interaction and determining when the interaction will start. Experiments on the large-scale egocentric dataset EGO4D show that our method outperformed state-of-the-art approaches on the considered task. Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022. Please see the project web page for code and additional details: https://iplab.dmi.unict.it/stillfast/.

StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation

TL;DR

Abstract

Paper Structure (16 sections, 4 figures, 4 tables)

This paper contains 16 sections, 4 figures, 4 tables.

Introduction
Related Work
Anticipation in Third Person Vision
Anticipation in First Person Vision
Short-Term Object Interaction Anticipation
Still Fast Network
StillFast Backbone
Prediction Head
Experimental Settings
Dataset and Evaluation Measures
Compared Methods
Results
Comparison with the State of the Art
Ablation study
Conclusion
...and 1 more sections

Figures (4)

Figure 1: Short-term object interaction anticipation task. Models can process a video $V$ up to time $t$ (denoted as $V_{:t}$) predicting the bounding box and the class related to the next-active objects, the verb which describes the future interaction, a real number which indicating when the interaction will happen ($t+\delta$) and a score. $\delta$ represents the time interval between the last observable frame $V_t$ and the frame of contact at time $t+\delta$.
Figure 2: StillFast is composed of a two-branch backbone. Given an input video $V$ and a timestamp $t$, the proposed model takes as input a high resolution frame $V_t$ (top) and a low resolution video $(V_{(t-\tau_o):t})$ (bottom). A 2D Backbone ("still" branch) processes the high resolution frame $V_t$, producing a stack of 2D features $\Phi_{2D}(V_t)$. A 3D Backbone ("fast" branch), processes a low resolution video $V_{(t-\tau_o):t}$ obtaining a stack of 3D features $\Phi_{3D}(V_{(t-\tau_o):t})$. The Combined Feature Pyramid Layer is responsible to: 1) up-sample the stack of 3D features with nearest neighbor interpolation to match the spatial resolution of the 2D features and averages over the temporal dimension obtaining the $\Phi^{2D}_{3D}(V_{(t-\tau_o):t})$ features which have the same dimension of 2D features $\Phi_{2D}(V_t)$, 2) fuse these stack of features obtaining the final combined feature pyramid $P_t$. Before and after the sum operation we added 3x3 convolutional layers to remove artifacts introduced with the up-sampling and sum operations.
Figure 3: StillFast Prediction Head is based on the Faster R-CNN prediction head. From the Combined Feature Pyramid $P_t$ we obtain global and local features. Local features are obtained through a Region Proposal Network (RPN) which predicts region proposals, from which we compute local features through a RoI Align layer. Global features are obtained with a Global Average Pooling operation and are concatenated with local features. These features are fed in a fusion network and then are summed to the original local features through residual connections. These local-global representations are finally used to predict object (noun) and verb probability distributions and time-to-contact (ttc) through linear layers along with the related prediction score $s$.
Figure 4: Two success examples (left) and two failure cases (right).

StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation

TL;DR

Abstract

StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)