Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando; Rosario Forte; Antonino Furnari

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando, Rosario Forte, Antonino Furnari

TL;DR

Competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval, showing promising results also when compared to clound-based solutions.

Abstract

We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

TL;DR

Competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval, showing promising results also when compared to clound-based solutions.

Abstract

Paper Structure (17 sections, 7 equations, 3 figures, 5 tables)

This paper contains 17 sections, 7 equations, 3 figures, 5 tables.

INTRODUCTION
Related Works
Episodic Memory Question Answering
Streaming Multimodal Large Language Models
METHOD
Descriptor Thread
QA Thread
Prompt Design
EXPERIMENTAL SETTINGS
DEPLOYMENT SCENARIOS
STREAMING CONSTRAINT
MODELS AND SETUP
CONFIGURATION SELECTION UNDER REAL-TIME CONSTRAINTS
QUERY-TIME RESPONSIVENESS VIA TIME-TO-FIRST-TOKEN
MAIN RESULTS
...and 2 more sections

Figures (3)

Figure 1: High-level overview of the proposed Edge-based OEM-VQA system. The user wears smart glasses continuously streaming video to a local unit (GPU). The video is continuously processed into a textual memory M by the Descriptor Thread, allowing the user to ask questions that are ingested into the QA Thread that leverages the textual memory to reply to the user, sending the answer without storing raw video frames.
Figure 2: Overview of the Streaming OEM-VQA Framework. The architecture is organized into two asynchronous threads: Descriptor Thread: Processes handles the continuous streamed video clips ($c_k$) of $s$ seconds. A Video LLM Descriptor generates a textual description ($d_k$) for each clip in execution time $T_{\text{des}}$, incrementally populating the semantic Memory $M$. QA Thread: Activated upon user query, this thread utilizes the stored textual Memory $M$ and a Reasoner model to deduce the Answer in time $T_{\text{ans}}$.
Figure 3: Overview of the adopted prompting strategy. The Descriptor prompt consists of four main components: 1) Task Description: instructs the model on the specific task to perform; 2) Detailed Instructions: provides specific guidelines, such as prioritizing actions or spatial positioning; 3) Question Template: primes the model with potential future questions; and 4) In-Context Learning Examples: provides a full clip description example to encourage adherence to output guidelines. Additionally, the 5) Reasoner Prompt is used at query time, providing the model with the question, candidate answers, and the accumulated memory history.

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

TL;DR

Abstract

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Authors

TL;DR

Abstract

Table of Contents

Figures (3)