AMEGO: Active Memory from long EGOcentric videos

Gabriele Goletto; Tushar Nagarajan; Giuseppe Averta; Dima Damen

AMEGO: Active Memory from long EGOcentric videos

Gabriele Goletto, Tushar Nagarajan, Giuseppe Averta, Dima Damen

TL;DR

AMEGO tackles the challenge of understanding very-long egocentric videos by building an online, semantic-free memory that encodes hand–object interactions as HOI tracklets and location segments as activity-centric hotspots, forming $\\mathcal{E} = {\\mathcal{O}, \\mathcal{L}}$. This memory supports querying without reprocessing entire footage, enabling efficient, multi-faceted QA about when objects were used, where activities occurred, and how interactions unfolded. To evaluate this approach, the authors introduce the Active Memories Benchmark (AMB), a 20.5k-question, vision-first benchmark covering sequencing, concurrency, and temporal grounding in long EPIC-KITCHENS videos. Experiments show AMEGO achieves state-of-the-art performance on AMB, markedly surpassing baselines and demonstrating robustness, interpretability, and potential for scalable analysis of procedural egocentric activities, with stronger performance when object–location interplay is leveraged.

Abstract

Egocentric videos provide a unique perspective into individuals' daily experiences, yet their unstructured nature presents challenges for perception. In this paper, we introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos. Inspired by the human's ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video, capturing key locations and object interactions. This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content. Additionally, to evaluate our understanding of very-long egocentric videos, we introduce the new Active Memories Benchmark (AMB), composed of more than 20K of highly challenging visual queries from EPIC-KITCHENS. These queries cover different levels of video reasoning (sequencing, concurrency and temporal grounding) to assess detailed video understanding capabilities. We showcase improved performance of AMEGO on AMB, surpassing other video QA baselines by a substantial margin.

AMEGO: Active Memory from long EGOcentric videos

TL;DR

. This memory supports querying without reprocessing entire footage, enabling efficient, multi-faceted QA about when objects were used, where activities occurred, and how interactions unfolded. To evaluate this approach, the authors introduce the Active Memories Benchmark (AMB), a 20.5k-question, vision-first benchmark covering sequencing, concurrency, and temporal grounding in long EPIC-KITCHENS videos. Experiments show AMEGO achieves state-of-the-art performance on AMB, markedly surpassing baselines and demonstrating robustness, interpretability, and potential for scalable analysis of procedural egocentric activities, with stronger performance when object–location interplay is leveraged.

Abstract

Paper Structure (24 sections, 2 equations, 10 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 2 equations, 10 figures, 6 tables, 2 algorithms.

Introduction
Related Works
Method - AMEGO
Object Interactions
Location segments
Querying AMEGO representations
Active Memories Benchmark
Query Criteria
Benchmark Construction
Experiments
Experimental setup
Implementation details
Standalone performance
Results on Active Memories Benchmark
Conclusion
...and 9 more sections

Figures (10)

Figure 1: AMEGO captures key locations and object interactions in a structured representation. In the each frame on top, the external border colour refers to a specific location in AMEGO while colours of objects define specific instances. AMEGO unlocks fine-grained long video understanding allowing multiple queries, such as the one depicted at the bottom of the figure, without reprocessing the long input video.
Figure 2: We build $\mathcal{O}$ in an online manner, performing the 3 steps depicted at each frame of our video. (i) Initialisation We use consistent active object detections to generate new HOI tracklets. We thus discard noise resulting in sparse detections. (ii) Updating Once a new tracklet is initialised, we use a SOT tracker ($\mathcal{T}_{o_i}$) to update its detections even when hands go out of the field of view. We end the tracklet when there are $e_o$ consecutive frames with a free hand or a distinctive new object interaction. (iii) Assignment Once a tracklet terminates, we assign it an object instance based on the similarity between its visual features wrt those in memory $\mathcal{O}$.
Figure 3: An example of AMEGO on a long egocentric video depicting objects interacting with the left and right hand of the subject and the visited locations.
Figure 4: Examples queries of Active Memories Benchmark on an egocentric video (in the middle). We build our benchmark around 3 different levels of reasoning, i.e. Sequencing, Concurrency and Temporal grounding.
Figure 5: Quantitative results depending on the temporal duration of the queried video.
...and 5 more figures

AMEGO: Active Memory from long EGOcentric videos

TL;DR

Abstract

AMEGO: Active Memory from long EGOcentric videos

Authors

TL;DR

Abstract

Table of Contents

Figures (10)