Table of Contents
Fetching ...

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando, Rosario Forte, Antonino Furnari

TL;DR

Competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval, showing promising results also when compared to clound-based solutions.

Abstract

We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

TL;DR

Competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval, showing promising results also when compared to clound-based solutions.

Abstract

We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.
Paper Structure (17 sections, 7 equations, 3 figures, 5 tables)

This paper contains 17 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: High-level overview of the proposed Edge-based OEM-VQA system. The user wears smart glasses continuously streaming video to a local unit (GPU). The video is continuously processed into a textual memory M by the Descriptor Thread, allowing the user to ask questions that are ingested into the QA Thread that leverages the textual memory to reply to the user, sending the answer without storing raw video frames.
  • Figure 2: Overview of the Streaming OEM-VQA Framework. The architecture is organized into two asynchronous threads: Descriptor Thread: Processes handles the continuous streamed video clips ($c_k$) of $s$ seconds. A Video LLM Descriptor generates a textual description ($d_k$) for each clip in execution time $T_{\text{des}}$, incrementally populating the semantic Memory $M$. QA Thread: Activated upon user query, this thread utilizes the stored textual Memory $M$ and a Reasoner model to deduce the Answer in time $T_{\text{ans}}$.
  • Figure 3: Overview of the adopted prompting strategy. The Descriptor prompt consists of four main components: 1) Task Description: instructs the model on the specific task to perform; 2) Detailed Instructions: provides specific guidelines, such as prioritizing actions or spatial positioning; 3) Question Template: primes the model with potential future questions; and 4) In-Context Learning Examples: provides a full clip description example to encourage adherence to output guidelines. Additionally, the 5) Reasoner Prompt is used at query time, providing the model with the question, candidate answers, and the accumulated memory history.