Table of Contents
Fetching ...

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

Gengyuan Zhang, Mingcong Ding, Tong Liu, Yao Zhang, Volker Tresp

TL;DR

This work investigates memory as context for streaming video event understanding with multimodal large language models (MLLMs). It formalizes memory as long-term $oldsymbol{ ext{M}^{l}}$ and short-term $oldsymbol{ ext{M}^{s}}$ memories that are prepended to inputs, and demonstrates that memory improves understanding but can introduce confabulations when memories are predicted rather than ground-truth. To mitigate this, the authors introduce CAMEO, a confabulation-aware memory modification framework that uses semantic entropy se to quantify narrational credibility and reweights attention on confabulated memories via $w(oldsymbol{\, exthat{t}})=1/\\exp(-\tau \cdot se(\boldsymbol{\hat{t}}))$, combined with uncertainty estimation and probing of confabulation-prone attention heads. Empirical evaluation on Ego4D with OPT-2.7B and Vicuna-7B shows that memory as context indeed benefits event understanding, but streaming confabulation can degrade performance, which CAMEO effectively mitigates, yielding stronger streaming performance. The work highlights practical memory-augmented streaming inference while addressing misinformation risk, with implications for real-world, temporally extended video understanding in multimodal systems.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

TL;DR

This work investigates memory as context for streaming video event understanding with multimodal large language models (MLLMs). It formalizes memory as long-term and short-term memories that are prepended to inputs, and demonstrates that memory improves understanding but can introduce confabulations when memories are predicted rather than ground-truth. To mitigate this, the authors introduce CAMEO, a confabulation-aware memory modification framework that uses semantic entropy se to quantify narrational credibility and reweights attention on confabulated memories via , combined with uncertainty estimation and probing of confabulation-prone attention heads. Empirical evaluation on Ego4D with OPT-2.7B and Vicuna-7B shows that memory as context indeed benefits event understanding, but streaming confabulation can degrade performance, which CAMEO effectively mitigates, yielding stronger streaming performance. The work highlights practical memory-augmented streaming inference while addressing misinformation risk, with implications for real-world, temporally extended video understanding in multimodal systems.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.

Paper Structure

This paper contains 42 sections, 4 equations, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: Knowing the memory of past events can help understand the current event. However, for streaming events in videos, we cannot access ground-truth narrations for previous events and this leads to confabulation.
  • Figure 2: Model Pipeline for memory as contexts for streaming events reasoning. We interleave the events and narrations from long-term and short-term memory as contextual inputs.
  • Figure 3: Performance improvements with CAMEO.
  • Figure 4: Evaluation of Vicuna-7b with different training ratios.
  • Figure 5: Evaluation of OPT-2.7B with different ratios.
  • ...and 4 more figures