Table of Contents
Fetching ...

Anticipating Next Active Objects for Egocentric Videos

Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio Del Bue

TL;DR

This work targets Anticipating Next ACTive Object (ANACTO) in egocentric video, aiming to locate the object a person will contact within a future TTC window before any action occurs. It introduces T-ANACTO, a transformer-based encoder–decoder that fuses Faster R-CNN object detections with a Vision Transformer to model hand–object interactions and FoV drift, using an autoregressive decoder to predict the next-active-object location across future frames. Training employs three losses—$\mathcal{L}_{feat}$, $\mathcal{L}_{cao}$, and $\mathcal{L}_{nao}$—combined as $\mathcal{L} = \mathcal{L}_{feat} + \lambda_1 \mathcal{L}_{cao} + \lambda_2 \mathcal{L}_{nao}$—and is evaluated on EK-100, EGTEA+, and Ego4D with new ANACTO annotations. Results show that incorporating object-centric cues and transformer-based temporal attention yields consistent improvements over strong baselines (AVT, RULSTM, TSN) across different TTC settings, with qualitative attention visualizations corroborating the model's focus on prospective interaction regions and NAO locations. The work advances proactive understanding of future first-person interactions and provides ANACTO annotations to support future research in egocentric video analysis.

Abstract

This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called ``time to contact'' (TTC) segment. Many methods have been proposed to anticipate the action of a person based on previous hand movements and interactions with the surroundings. However, there have been no attempts to investigate the next possible interactable object, and its future location with respect to the first-person's motion and the field-of-view drift during the TTC window. We define this as the task of Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip. We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and Ego4D. We also provide annotations for the first two datasets. Our approach performs best compared to relevant baseline methods. We also conduct ablation studies to understand the effectiveness of the proposed and baseline methods on varying conditions. Code and ANACTO task annotations will be made available upon paper acceptance.

Anticipating Next Active Objects for Egocentric Videos

TL;DR

This work targets Anticipating Next ACTive Object (ANACTO) in egocentric video, aiming to locate the object a person will contact within a future TTC window before any action occurs. It introduces T-ANACTO, a transformer-based encoder–decoder that fuses Faster R-CNN object detections with a Vision Transformer to model hand–object interactions and FoV drift, using an autoregressive decoder to predict the next-active-object location across future frames. Training employs three losses—, , and —combined as —and is evaluated on EK-100, EGTEA+, and Ego4D with new ANACTO annotations. Results show that incorporating object-centric cues and transformer-based temporal attention yields consistent improvements over strong baselines (AVT, RULSTM, TSN) across different TTC settings, with qualitative attention visualizations corroborating the model's focus on prospective interaction regions and NAO locations. The work advances proactive understanding of future first-person interactions and provides ANACTO annotations to support future research in egocentric video analysis.

Abstract

This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called ``time to contact'' (TTC) segment. Many methods have been proposed to anticipate the action of a person based on previous hand movements and interactions with the surroundings. However, there have been no attempts to investigate the next possible interactable object, and its future location with respect to the first-person's motion and the field-of-view drift during the TTC window. We define this as the task of Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip. We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and Ego4D. We also provide annotations for the first two datasets. Our approach performs best compared to relevant baseline methods. We also conduct ablation studies to understand the effectiveness of the proposed and baseline methods on varying conditions. Code and ANACTO task annotations will be made available upon paper acceptance.
Paper Structure (23 sections, 5 equations, 13 figures, 4 tables)

This paper contains 23 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The goal of our work is to anticipate the next-active-object, i.e. to localize the object that the person will interact with in the first frame of an action segment, based on the evidence of video clip of length $\tau_o$, located $\tau_a$ seconds (anticipation time) before the beginning of an action segment at time-step $t = \tau_s$.
  • Figure 2: Our T-ANACTO model is an encoder-decoder architecture. Its encoder is composed of an object detector and a Vision Transformervit. The object detector fastercnn takes an input frame (e.g., size of 1920$\times$1080) and predicts the location of objects in terms of bounding boxes ($x$, $y$, $w$, $h$) and detection confidence scores ($c$). The input of VIT are the frame(s), first resized to, 224$\times$224 and then divided into the patches (16$\times$16). The object detections ($x$, $y$, $w$, $h$) are also converted to match the scaled size of the frame (i.e., 224$\times$224), reshaped, and are then passed through a MLP to convert it into the same dimension as the embeddings from the transformer encoder, which are later concatenated together to be given to the decoder. There exist a linear layer between the decoder and the T-ANACTO encoder, which adjusts the feature dimensions to be fed to the transformer decoder. Transformer decoder uses temporal aggregation to predict the next active object. For each frame, the decoder aggregate the features from the encoder for current and past frames along with the embeddings of last predicted active objects and then predicts the next active object for the future frames.
  • Figure 3: The observed video segment of length $\tau_o$ is sampled at a frame rate equal to the TTC time (shown as $\tau_a$) to maintain consistency in (1) the frame interval of sampled frames and (2) between the last observed frame and the starting frame of the action segment, which starts at $t = \tau_s$.
  • Figure 4: The top row shows the "last observed frame", the middle row shows "the region of interest of T-ANACTO", and the bottom row shows "the starting frame of an action". The green box(es) in the last row represent the location of NAO bounding box in the starting frame(s) of action.
  • Figure 5: Results showing the attention map generated by our T-ANACTO encoder for last observed frame of video clip with TTC $\boldsymbol{\tau_a = 0.25}$ seconds before the beginning of the action. The red regions depicts the region of interest to identify the next active object in the starting frame of the action. The green bounding box for the starting frame of the action (row) shows the localization of the active object for that frame. It is interesting to note that for segments which there is no active object at the start of the action, our encoder is able to identify the possible area of interest for next future frames post the starting frame of the action.
  • ...and 8 more figures