Table of Contents
Fetching ...

MMA: Multimodal Memory Agent

Yihao Lu, Wanru Cheng, Zeyu Zhang, Hao Tang

TL;DR

The paper tackles the reliability of memory-augmented long-horizon agents by addressing retrieval traps and overconfident errors. It introduces MMA, a memory-agent with a Confidence Module that assigns each retrieved memory a reliability score based on source credibility, temporal decay, and cross-memory consensus, enabling reweighting and abstention when evidence is insufficient. It also presents MMA-Bench, a benchmark designed to stress test belief dynamics with controlled reliability priors and cross-modal contradictions, revealing phenomena like the Visual Placebo Effect in RAG-based systems. Across FEVER, LoCoMo, and MMA-Bench, MMA achieves comparable accuracy with substantially improved stability and calibrated abstention, demonstrating the practical value of epistemic prudence in memory-augmented decision making.

Abstract

Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.

MMA: Multimodal Memory Agent

TL;DR

The paper tackles the reliability of memory-augmented long-horizon agents by addressing retrieval traps and overconfident errors. It introduces MMA, a memory-agent with a Confidence Module that assigns each retrieved memory a reliability score based on source credibility, temporal decay, and cross-memory consensus, enabling reweighting and abstention when evidence is insufficient. It also presents MMA-Bench, a benchmark designed to stress test belief dynamics with controlled reliability priors and cross-modal contradictions, revealing phenomena like the Visual Placebo Effect in RAG-based systems. Across FEVER, LoCoMo, and MMA-Bench, MMA achieves comparable accuracy with substantially improved stability and calibrated abstention, demonstrating the practical value of epistemic prudence in memory-augmented decision making.

Abstract

Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.
Paper Structure (38 sections, 8 equations, 11 figures, 9 tables)

This paper contains 38 sections, 8 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Retrieval Trap Case Study. MIRIX fails by retrieving the high-similarity but irrelevant Memory B. MMA correctly identifies the credible Memory A using multi-dimensional reliability signals.
  • Figure 2: MMA Framework. The Confidence Module reweights retrieval via source reliability, temporal decay, and network consensus to modulate reasoning and abstention.
  • Figure 3: MMA-Bench evaluation framework. The benchmark integrates cross-modal consistency analysis, risk-aware betting, and a $2\times2$ logic matrix for trust conflicts. Performance is assessed through fundamental QA and a 3-step belief probe.
  • Figure 4: Detailed Dynamics Analysis. (a) Step-wise belief revision; (b) Risk-adjusted scores highlighting visual noise sensitivity; (c) Gap analysis between retrieval accuracy and calibration.
  • Figure 5: Evolutionary Logic Spectrum. Tracing performance from Foundation Model to MMA. MMA restores agency (Activation) and buffers inherited visual bias (Placebo Effect).
  • ...and 6 more figures