Table of Contents
Fetching ...

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finocchiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, Christian Micheloni

TL;DR

This work introduces Online Visual Query 2D (OVQ2D), a streaming, online variant of episodic memory retrieval for egocentric vision, and presents ESOM, a three-component framework that discovers, tracks, and memorizes objects online to support efficient visual query localization. ESOM’s object memory (M_ego) stores compact spatio-temporal representations, populated by an Object Memory Population pipeline and queried by Query Retrieval and Localization to retrieve the most recent matching track. Experiments on Ego4D demonstrate ESOM’s superiority among online approaches and reveal that current object detectors and trackers remain the main bottlenecks, with oracle components drastically boosting performance. The work provides a principled OVQ2D benchmark, analyzes memory-efficiency trade-offs, and lays a foundation for deploying episodic memory on real-world wearable devices, while highlighting the need for advances in perception modules for practical impact.

Abstract

Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an "offline" setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM's superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM's accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

TL;DR

This work introduces Online Visual Query 2D (OVQ2D), a streaming, online variant of episodic memory retrieval for egocentric vision, and presents ESOM, a three-component framework that discovers, tracks, and memorizes objects online to support efficient visual query localization. ESOM’s object memory (M_ego) stores compact spatio-temporal representations, populated by an Object Memory Population pipeline and queried by Query Retrieval and Localization to retrieve the most recent matching track. Experiments on Ego4D demonstrate ESOM’s superiority among online approaches and reveal that current object detectors and trackers remain the main bottlenecks, with oracle components drastically boosting performance. The work provides a principled OVQ2D benchmark, analyzes memory-efficiency trade-offs, and lays a foundation for deploying episodic memory on real-world wearable devices, while highlighting the need for advances in perception modules for practical impact.

Abstract

Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an "offline" setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM's superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM's accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

Paper Structure

This paper contains 59 sections, 6 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Online visual query localization via object memorization and retrieval. We tackle the problem of online episodic memory and propose ESOM ( ), an architecture that processes an egocentric video ($\mathcal{V}_{ego}$) online and only once to detect (), track (), and memorize ($\mathcal{M}_{ego}$) user-relevant objects and frames. When retrieving a visual query ($\mathbf{Q}$), ESOM ( ) searches () its memory ($\mathcal{M}_{ego}$) for the most recent instance where the query was spatio-temporally localized. By avoiding video storage, ESOM optimizes both memory usage and retrieval speed.
  • Figure 2: From an egocentric video to a memory of objects.ESOM injects the visual information in each frame $\mathbf{F}_t$ of a video $\mathcal{V}_{ego}$ into $\mathcal{M}_{ego}$, an object memory represented as a dynamic list $\mathbf{O}_i$ of tuples $\mathbf{o}_{i,t}$ composed of instance-based, frame-level bounding boxes $\mathbf{b}_{i,t}$, related frames $\mathbf{F}_t$, and relevance labels $c_{i,t}$.$\mathcal{M}_{ego}$ is built by an Object Memory Population (OMP) algorithm which processes () frames $\mathbf{F}_t$ online.
  • Figure 3: Memory population by object tracking and discovery.The Object Tracking (OT) () reads objects $\mathbf{o}_{i,t-1}$ related to the last frame $\mathbf{F}_{t-1}$ from $\mathcal{M}_{ego}$ and updates their position in current frame $\mathbf{F}_t$. In parallel, new objects are detected by the Object Discovery (OD) () module in frame $\mathbf{F}_t$. Relevance scores $c_{i,t}$ are computed and $\mathcal{M}_{ego}$ is updated.
  • Figure 4: Visual query localization by memory retrieval. When the user provides a visual query $\mathbf{Q}$, the Query Retrieval and Localization algorithm () is triggered. This algorithm compares the representation of $\mathbf{Q}$ with the representations of each object $\mathbf{O}_i$ in $\mathcal{M}_{ego}$. The sequence of contiguous bounding-boxes corresponding to the best-matched () object and the associated RGB frames are retained as the visual response track $\mathbf{r}$ for $\mathbf{Q}$, while all other objects () are discarded.
  • Figure 5: ESOM scales well while increasing memory size. Plots show how success score, storage space, and retrieval time change when building memory from progressively processed video segments. Results for three different OMP configurations.
  • ...and 5 more figures