Table of Contents
Fetching ...

Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, Huaxiu Yao

Abstract

AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or traditional AutoML to explore effectively. We deploy an autonomous research pipeline to discover Omni-SimpleMem, a unified multimodal memory framework for lifelong AI agents. Starting from a naïve baseline (F1=0.117 on LoCoMo), the pipeline autonomously executes ${\sim}50$ experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo (0.117$\to$0.598) and +214% on Mem-Gallery (0.254$\to$0.797) relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML. We provide a taxonomy of six discovery types and identify four properties that make multimodal memory particularly suited for autoresearch, offering guidance for applying autonomous research pipelines to other AI system domains. Code is available at this https://github.com/aiming-lab/SimpleMem.

Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

Abstract

AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or traditional AutoML to explore effectively. We deploy an autonomous research pipeline to discover Omni-SimpleMem, a unified multimodal memory framework for lifelong AI agents. Starting from a naïve baseline (F1=0.117 on LoCoMo), the pipeline autonomously executes experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo (0.1170.598) and +214% on Mem-Gallery (0.2540.797) relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML. We provide a taxonomy of six discovery types and identify four properties that make multimodal memory particularly suited for autoresearch, offering guidance for applying autonomous research pipelines to other AI system domains. Code is available at this https://github.com/aiming-lab/SimpleMem.

Paper Structure

This paper contains 58 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of the Discovery Process of Omni-SimpleMem.(a) Discovered architecture: multimodal inputs are filtered for novelty, compressed into MAUs, and retrieved via hybrid dense-sparse-graph search with pyramid expansion. (b) Autonomous optimization trajectory on Mem-Gallery: 39 experiments improve F1 from 0.254 to 0.793 (+214%).
  • Figure 2: Omni-SimpleMem architecture overview. Left: Selective ingestion filters multimodal inputs (text, image, audio, video) via modality-specific novelty detectors and creates MAUs with LLM-generated summaries and embeddings. Center: MAUs are stored in hot storage (summaries, embeddings, metadata) and cold storage (raw content), with entity extraction building a knowledge graph with typed entities and relations. Right: Retrieval combines dense (FAISS), sparse (BM25), and graph ($h$-hop) search via set-union merging, then progressively expands results through a pyramid mechanism (summaries $\to$ full text $\to$ raw content) under a token budget $B$.
  • Figure 3: Optimization trajectories on LoCoMo (top, 9 iterations) and Mem-Gallery (bottom, 39 experiments across 7 phases). Solid lines denote accepted iterations; failed/reverted experiments are marked with $\times$. The dashed red line indicates the previous SOTA. Key discoveries at each stage are annotated.
  • Figure 4: Throughput vs. F1. Omni-SimpleMem with 8 workers achieves 3.5$\times$ higher throughput.