Table of Contents
Fetching ...

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li

TL;DR

ViLoMem introduces a dual-stream multimodal semantic memory that separately models visual distraction patterns and logical hallucination errors, coordinating retrieval through question-aware attention and precise-formation memory updates within a grow-and-refine cycle. The framework enables progressive, lifelong learning by preserving stable, generalizable strategies while suppressing forgetting across six multimodal benchmarks. Ablations confirm that both streams are essential and complementary, with cross-model and cross-domain analyses revealing nuanced transfer dynamics. Across diverse model scales, ViLoMem yields consistent pass@1 improvements, particularly in visually grounded mathematical reasoning, and demonstrates potential as a lightweight mechanism for knowledge distillation and continual learning in multimodal agents.

Abstract

MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

TL;DR

ViLoMem introduces a dual-stream multimodal semantic memory that separately models visual distraction patterns and logical hallucination errors, coordinating retrieval through question-aware attention and precise-formation memory updates within a grow-and-refine cycle. The framework enables progressive, lifelong learning by preserving stable, generalizable strategies while suppressing forgetting across six multimodal benchmarks. Ablations confirm that both streams are essential and complementary, with cross-model and cross-domain analyses revealing nuanced transfer dynamics. Across diverse model scales, ViLoMem yields consistent pass@1 improvements, particularly in visually grounded mathematical reasoning, and demonstrates potential as a lightweight mechanism for knowledge distillation and continual learning in multimodal agents.

Abstract

MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.

Paper Structure

This paper contains 27 sections, 11 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Multimodal Semantic Memory Enables Progressive Learning. When solving multimodal problems, early attempts may contain both logical and visual errors; through feedback, the model refines its logical memory for question-appropriate theorem application and its visual memory to avoid perceptual traps—improving by integrating the where to look with the how to reason.
  • Figure 2: Overview of the ViLoMem framework. (a) Memory Cycle: A closed-loop learning mechanism where both logical and visual memories are retrieved and utilized by the solver. Retrieval is conditioned on the textual question and its paired image. The solver then performs reasoning steps (actions), which are evaluated by the verifier to filter redundant or invalid trajectories. The remaining trajectories are used to update both memory streams according to their respective types. (b) Memory Generation: An error-attribution framework that employs an LLM for logical analysis and an MLLM for visual analysis, producing structured memory schemas through similarity-based merge and create operations. (c) Memory Retrieval: Specialized dual-stream retrieval mechanism. Visual memories undergo a two-stage process involving image-embedding retrieval followed by question-specific retrieval, since visual information must be conditioned on both image content and the textual query. Logical memories are retrieved through problem analysis and text-embedding similarity.
  • Figure 3: Visual memory generation and retrieval examples. Each case shows the original error, the extracted visual pattern, and successful retrieval in analogous scenarios.
  • Figure 4: Analysis of dual stream memory usage patterns across six benchmarks. (a) Memory generation and retrieval statistics show that visual errors dominate generation (59% to 93%), while retrieval operations significantly exceed generation events. (b) Cross task dependency analysis reveals balanced utilization of both memory streams during retrieval across diverse tasks and models.
  • Figure 5: Showcase of representative cases demonstrating ViLoMem's memory generation and retrieval process across different types of multimodal reasoning tasks.
  • ...and 5 more figures