Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs
Jiani Huang, Shijie Wang, Liang-bo Ning, Wenqi Fan, Shuaiqiang Wang, Dawei Yin, Qing Li
TL;DR
RecBench+ introduces a benchmark to evaluate LLM-based personalized recommendation assistants under realistic, interactive query settings. It defines two query paradigms—Condition-based and User Profile-based—and constructs ~30k high-quality textual queries grounded in Movielens-1M and Amazon-Book, with an Item KG and ground-truth generation via LLM simulations. The paper evaluates seven state-of-the-art LLMs, revealing that while LLMs can handle explicit conditions well, they struggle with reasoning-intensive, implicit, and misinformation cases, and performance also depends on user history and demographics. The work provides actionable insights for designing robust, context-aware recommendation assistants and releases RecBench+ for future research.
Abstract
Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs' ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at https://github.com/jiani-huang/RecBench.git.
