Table of Contents
Fetching ...

LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Feiyu Duan, Xuanjing Huang, Zhongyu Wei

Abstract

The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.

LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Abstract

The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.
Paper Structure (55 sections, 3 equations, 36 figures, 16 tables, 1 algorithm)

This paper contains 55 sections, 3 equations, 36 figures, 16 tables, 1 algorithm.

Figures (36)

  • Figure 1: Illustration of personal AI assistance grounded in long-horizon spatiotemporal context. User behaviors evolve with external environment, while reflecting stable personal traits. Effective response requires models to adjust their strategies to current context while leveraging interaction history to infer personal states.
  • Figure 2: Overview of the LifeSim framework. For each target user, the user profile consists of demographic attributes, personality traits, and long-term preferences, which together contribute to the long-term belief state. The BDI-based cognitive engine and the event engine jointly generate user intentions by integrating subjective belief states with physical environments. The user behavior engine then produces conversations by modeling memory perception, emotion inference, and action selection.
  • Figure 3: Long-horizon intention completion performance across different assistant models. The heatmaps report intention completion (I.C.) scores with respect to conversation length.
  • Figure 4: Performance of user preference recovery within life event sequences. The dashed line represents the regression curve fitted by linear regression.
  • Figure 5: Relative model performance across different intention types. Abbreviations: I.R. = Intent Recognition, I.C. = Intent Completion
  • ...and 31 more figures