Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

Gengyuan Zhang; Mingcong Ding; Tong Liu; Yao Zhang; Volker Tresp

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

Gengyuan Zhang, Mingcong Ding, Tong Liu, Yao Zhang, Volker Tresp

TL;DR

This work investigates memory as context for streaming video event understanding with multimodal large language models (MLLMs). It formalizes memory as long-term $oldsymbol{ ext{M}^{l}}$ and short-term $oldsymbol{ ext{M}^{s}}$ memories that are prepended to inputs, and demonstrates that memory improves understanding but can introduce confabulations when memories are predicted rather than ground-truth. To mitigate this, the authors introduce CAMEO, a confabulation-aware memory modification framework that uses semantic entropy se to quantify narrational credibility and reweights attention on confabulated memories via $w(oldsymbol{\, exthat{t}})=1/\\exp(-\tau \cdot se(\boldsymbol{\hat{t}}))$, combined with uncertainty estimation and probing of confabulation-prone attention heads. Empirical evaluation on Ego4D with OPT-2.7B and Vicuna-7B shows that memory as context indeed benefits event understanding, but streaming confabulation can degrade performance, which CAMEO effectively mitigates, yielding stronger streaming performance. The work highlights practical memory-augmented streaming inference while addressing misinformation risk, with implications for real-world, temporally extended video understanding in multimodal systems.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

TL;DR

This work investigates memory as context for streaming video event understanding with multimodal large language models (MLLMs). It formalizes memory as long-term

and short-term

memories that are prepended to inputs, and demonstrates that memory improves understanding but can introduce confabulations when memories are predicted rather than ground-truth. To mitigate this, the authors introduce CAMEO, a confabulation-aware memory modification framework that uses semantic entropy se to quantify narrational credibility and reweights attention on confabulated memories via

, combined with uncertainty estimation and probing of confabulation-prone attention heads. Empirical evaluation on Ego4D with OPT-2.7B and Vicuna-7B shows that memory as context indeed benefits event understanding, but streaming confabulation can degrade performance, which CAMEO effectively mitigates, yielding stronger streaming performance. The work highlights practical memory-augmented streaming inference while addressing misinformation risk, with implications for real-world, temporally extended video understanding in multimodal systems.

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

TL;DR

Abstract

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)