Anticipating Next Active Objects for Egocentric Videos
Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio Del Bue
TL;DR
This work targets Anticipating Next ACTive Object (ANACTO) in egocentric video, aiming to locate the object a person will contact within a future TTC window before any action occurs. It introduces T-ANACTO, a transformer-based encoder–decoder that fuses Faster R-CNN object detections with a Vision Transformer to model hand–object interactions and FoV drift, using an autoregressive decoder to predict the next-active-object location across future frames. Training employs three losses—$\mathcal{L}_{feat}$, $\mathcal{L}_{cao}$, and $\mathcal{L}_{nao}$—combined as $\mathcal{L} = \mathcal{L}_{feat} + \lambda_1 \mathcal{L}_{cao} + \lambda_2 \mathcal{L}_{nao}$—and is evaluated on EK-100, EGTEA+, and Ego4D with new ANACTO annotations. Results show that incorporating object-centric cues and transformer-based temporal attention yields consistent improvements over strong baselines (AVT, RULSTM, TSN) across different TTC settings, with qualitative attention visualizations corroborating the model's focus on prospective interaction regions and NAO locations. The work advances proactive understanding of future first-person interactions and provides ANACTO annotations to support future research in egocentric video analysis.
Abstract
This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called ``time to contact'' (TTC) segment. Many methods have been proposed to anticipate the action of a person based on previous hand movements and interactions with the surroundings. However, there have been no attempts to investigate the next possible interactable object, and its future location with respect to the first-person's motion and the field-of-view drift during the TTC window. We define this as the task of Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip. We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and Ego4D. We also provide annotations for the first two datasets. Our approach performs best compared to relevant baseline methods. We also conduct ablation studies to understand the effectiveness of the proposed and baseline methods on varying conditions. Code and ANACTO task annotations will be made available upon paper acceptance.
