SimUSER: Simulating User Behavior with Large Language Models for Recommender System Evaluation
Nicolas Bougie, Narimasa Watanabe
TL;DR
The paper tackles the persistent gap between offline metrics and real user behavior in recommender systems by introducing SimUSER, a scalable, cost-effective approach that uses LLM-based agents as believable human proxies. It presents a two-phase framework comprising persona consistency-based matching and persona-driven interaction with a retrieval-augmented RS, supported by memory and perception modules. Key contributions include a memory-graph memory, PathSim-based retrieval, multimodal perception via thumbnails, and multi-round causal action refinement, all validated across MovieLens, AmazonBook, and Steam to show closer alignment to human behavior and improved offline-online metric correlation. The approach offers a practical, extensible pathway to bridge offline evaluations and real-world engagement, enabling better RS development and parameter tuning without extensive online testing.
Abstract
Recommender systems play a central role in numerous real-life applications, yet evaluating their performance remains a significant challenge due to the gap between offline metrics and online behaviors. Given the scarcity and limits (e.g., privacy issues) of real user data, we introduce SimUSER, an agent framework that serves as believable and cost-effective human proxies. SimUSER first identifies self-consistent personas from historical data, enriching user profiles with unique backgrounds and personalities. Then, central to this evaluation are users equipped with persona, memory, perception, and brain modules, engaging in interactions with the recommender system. SimUSER exhibits closer alignment with genuine humans than prior work, both at micro and macro levels. Additionally, we conduct insightful experiments to explore the effects of thumbnails on click rates, the exposure effect, and the impact of reviews on user engagement. Finally, we refine recommender system parameters based on offline A/B test results, resulting in improved user engagement in the real world.
