Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso; Matteo Dunnhofer; Antonino Furnari; Moritz Nottebaum; Antonio Finocchiaro; Davide Marana; Rosario Forte; Giovanni Maria Farinella; Christian Micheloni

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finocchiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, Christian Micheloni

TL;DR

This work introduces Online Visual Query 2D (OVQ2D), a streaming, online variant of episodic memory retrieval for egocentric vision, and presents ESOM, a three-component framework that discovers, tracks, and memorizes objects online to support efficient visual query localization. ESOM’s object memory (M_ego) stores compact spatio-temporal representations, populated by an Object Memory Population pipeline and queried by Query Retrieval and Localization to retrieve the most recent matching track. Experiments on Ego4D demonstrate ESOM’s superiority among online approaches and reveal that current object detectors and trackers remain the main bottlenecks, with oracle components drastically boosting performance. The work provides a principled OVQ2D benchmark, analyzes memory-efficiency trade-offs, and lays a foundation for deploying episodic memory on real-world wearable devices, while highlighting the need for advances in perception modules for practical impact.

Abstract

Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an "offline" setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM's superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM's accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

TL;DR

Abstract

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)