Table of Contents
Fetching ...

Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs

Jiani Huang, Shijie Wang, Liang-bo Ning, Wenqi Fan, Shuaiqiang Wang, Dawei Yin, Qing Li

TL;DR

RecBench+ introduces a benchmark to evaluate LLM-based personalized recommendation assistants under realistic, interactive query settings. It defines two query paradigms—Condition-based and User Profile-based—and constructs ~30k high-quality textual queries grounded in Movielens-1M and Amazon-Book, with an Item KG and ground-truth generation via LLM simulations. The paper evaluates seven state-of-the-art LLMs, revealing that while LLMs can handle explicit conditions well, they struggle with reasoning-intensive, implicit, and misinformation cases, and performance also depends on user history and demographics. The work provides actionable insights for designing robust, context-aware recommendation assistants and releases RecBench+ for future research.

Abstract

Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs' ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at https://github.com/jiani-huang/RecBench.git.

Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs

TL;DR

RecBench+ introduces a benchmark to evaluate LLM-based personalized recommendation assistants under realistic, interactive query settings. It defines two query paradigms—Condition-based and User Profile-based—and constructs ~30k high-quality textual queries grounded in Movielens-1M and Amazon-Book, with an Item KG and ground-truth generation via LLM simulations. The paper evaluates seven state-of-the-art LLMs, revealing that while LLMs can handle explicit conditions well, they struggle with reasoning-intensive, implicit, and misinformation cases, and performance also depends on user history and demographics. The work provides actionable insights for designing robust, context-aware recommendation assistants and releases RecBench+ for future research.

Abstract

Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs' ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at https://github.com/jiani-huang/RecBench.git.

Paper Structure

This paper contains 32 sections, 3 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustration of Different Recommendation Paradigms. (a) Traidtional-based recommendations generate suggestions in simple and fixed scenarios by learning user/item representations, like Top-K or similar item recommendations. (b) LLM-based recommendation relies on fixed and simple query templates, restricting users to predefined formats. (c) In contrast, our benchmark evaluates the LLM’s ability to handle complex, flexible user queries as a recommendation assistant.
  • Figure 2: An overview of the pipeline of constructing Condition-based Query. The process starts by constructing an Item Knowledge Graph (KG) that links items to their attributes. The second step is to identify item sets from the user’s history that share the same relations (e.g., movies directed by the same person) using the KG. These shared relations serve as the basis for constructing different types of conditions for the query. Finally, we use LLMs to simulate users and generate queries based on conditions, and the aforementioned item set serves as the ground truth.
  • Figure 3: An overview of the pipeline of constructing User Profile-based Query. (a) For Interest-based Query (top), we identify common interests by extracting collaborative item sets and preceding item sequences from user interactions, and queries are generated based on inferred interests. (b) For Demographics-based Query (bottom), users are grouped by demographics, and queries are generated using demographics and popular items within each group.
  • Figure 4: Performance on Condition-based Query with different number of conditions.
  • Figure 5: The effect of incorporating user-item history.
  • ...and 4 more figures