LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

Zihao Cheng; Weixin Wang; Yu Zhao; Ziyang Ren; Jiaxuan Chen; Ruiyang Xu; Shuai Huang; Yang Chen; Guowei Li; Mengshi Wang; Yi Xie; Ren Zhu; Zeren Jiang; Keda Lu; Yihong Li; Xiaoliang Wang; Liwei Liu; Cam-Tu Nguyen

LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

Zihao Cheng, Weixin Wang, Yu Zhao, Ziyang Ren, Jiaxuan Chen, Ruiyang Xu, Shuai Huang, Yang Chen, Guowei Li, Mengshi Wang, Yi Xie, Ren Zhu, Zeren Jiang, Keda Lu, Yihong Li, Xiaoliang Wang, Liwei Liu, Cam-Tu Nguyen

TL;DR

This work introduces Lifebench, which features densely connected, long-horizon event simulation, and draws inspiration from cognitive science and structure events according to their partonomic hierarchy; enabling efficient parallel generation while maintaining global coherence.

Abstract

Long-term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily target declarative memory, specifically semantic and episodic types, where all information is explicitly presented in dialogues. In contrast, real-world actions are also governed by non-declarative memory, including habitual and procedural types, and need to be inferred from diverse digital traces. To bridge this gap, we introduce Lifebench, which features densely connected, long-horizon event simulation. It pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning across diverse and temporally extended contexts. Building such a benchmark presents two key challenges: ensuring data quality and scalability. We maintain data quality by employing real-world priors, including anonymized social surveys, map APIs, and holiday-integrated calendars, thus enforcing fidelity, diversity and behavioral rationality within the dataset. Towards scalability, we draw inspiration from cognitive science and structure events according to their partonomic hierarchy; enabling efficient parallel generation while maintaining global coherence. Performance results show that top-tier, state-of-the-art memory systems reach just 55.2\% accuracy, highlighting the inherent difficulty of long-horizon retrieval and multi-source integration within our proposed benchmark. The dataset and data synthesis code are available at https://github.com/1754955896/LifeBench.

LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

TL;DR

Abstract

Paper Structure (79 sections, 6 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 79 sections, 6 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Related Work
LifeBench: Memory Simulation Framework
Design Principles
Synthesis Pipeline Architecture
Persona Synthesis
Basic information generation.
Social network.
Personalized urban landscape.
Hierarchical Outline Planning
Plot Generation.
Thematic Event Generation and Decomposition.
Atomic Event Optimization.
Dual-Agent Daily Activity Simulation
Subjective Agent:
...and 64 more sections

Figures (9)

Figure 1: LifeBench captures rich human activities, which are driven by not only declarative memory (semantic or episodic) but also by non-declarative one (habits, skills, etc.). The user actions and habits need to be inferred from fragmented and diverse digital traces (lower part).
Figure 2: The LifeBench data synthesis pipeline architecture and key algorithmic components.
Figure 3: Samples of LifeBench question types.
Figure 4: Statistics of Events and Phone Data.
Figure 5: Distribution of Query Time.
...and 4 more figures

LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

TL;DR

Abstract

LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

Authors

TL;DR

Abstract

Table of Contents

Figures (9)