Table of Contents
Fetching ...

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

Qijiong Liu, Jieming Zhu, Lu Fan, Kun Wang, Hengchang Hu, Wei Guo, Yong Liu, Xiao-Ming Wu

TL;DR

This work presents RecBench, a comprehensive benchmark for evaluating LLM-based recommender systems against conventional deep CTR and sequential models. It systematically varies item representations (unique ID, text, semantic embedding, semantic identifier) and assesses two core tasks, CTR (pair-wise) and SeqRec (list-wise), across five diverse datasets with zero-shot and fine-tuned LLMs. Key findings show that LLMs can surpass traditional baselines in accuracy, especially with semantic-enabled representations, but suffer severe inference latency that challenges real-time deployment. The study advocates hybrid approaches (LLM-for-RS) and targeted acceleration techniques, and commits to releasing code, data, and configurations to promote reproducibility and further progress in the field.

Abstract

In recent years, integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our findings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, configurations, and platform to enable other researchers to reproduce and build upon our experimental results.

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

TL;DR

This work presents RecBench, a comprehensive benchmark for evaluating LLM-based recommender systems against conventional deep CTR and sequential models. It systematically varies item representations (unique ID, text, semantic embedding, semantic identifier) and assesses two core tasks, CTR (pair-wise) and SeqRec (list-wise), across five diverse datasets with zero-shot and fine-tuned LLMs. Key findings show that LLMs can surpass traditional baselines in accuracy, especially with semantic-enabled representations, but suffer severe inference latency that challenges real-time deployment. The study advocates hybrid approaches (LLM-for-RS) and targeted acceleration techniques, and commits to releasing code, data, and configurations to promote reproducibility and further progress in the field.

Abstract

In recent years, integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our findings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, configurations, and platform to enable other researchers to reproduce and build upon our experimental results.

Paper Structure

This paper contains 28 sections, 10 equations, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Illustration of DLRM and LLM recommender in two scenarios. Each represents a placeholder that can be filled with various item representations, including unique identifier, text, semantic embedding or semantic identifier.
  • Figure 2: (Left) Various forms of item representations. (Right) Groups chosen for benchmarking and their representative methods. N/A: there is no or few related work, and it will not be evaluated.