Table of Contents
Fetching ...

Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis

Lanling Xu, Junjie Zhang, Bingqian Li, Jinpeng Wang, Sheng Chen, Wayne Xin Zhao, Ji-Rong Wen

TL;DR

This work tackles information overload in recommendation by systematically studying Large Language Models (LLMs) as recommender systems through a general LLM-RS framework. It formalizes input into natural language prompts and analyzes two main factors: the choice of foundation LLMs (public availability, tuning, architecture, scale, and context length) and the structure of prompts (task descriptions, user interest modeling, candidate item construction, and prompting strategies). Extensive experiments on MovieLens-1M and Amazon Books examine zero-shot ranking and CTR-style fine tuning, yielding insights such as the superiority of closed-source LLMs in cold-start scenarios, the gains from instruction tuning and LoRA-based fine-tuning, and the critical role of prompt design in eliciting effective recommendations. The findings offer practical guidance for deploying LLMs in recommendation, including grounding candidate items, leveraging retrieval-augmented strategies for user interest, and balancing efficiency with performance in real-world systems.

Abstract

Recently, Large Language Models~(LLMs) such as ChatGPT have showcased remarkable abilities in solving general tasks, demonstrating the potential for applications in recommender systems. To assess how effectively LLMs can be used in recommendation tasks, our study primarily focuses on employing LLMs as recommender systems through prompting engineering. We propose a general framework for utilizing LLMs in recommendation tasks, focusing on the capabilities of LLMs as recommenders. To conduct our analysis, we formalize the input of LLMs for recommendation into natural language prompts with two key aspects, and explain how our framework can be generalized to various recommendation scenarios. As for the use of LLMs as recommenders, we analyze the impact of public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results based on the classification of LLMs. As for prompt engineering, we further analyze the impact of four important components of prompts, \ie task descriptions, user interest modeling, candidate items construction and prompting strategies. In each section, we first define and categorize concepts in line with the existing literature. Then, we propose inspiring research questions followed by detailed experiments on two public datasets, in order to systematically analyze the impact of different factors on performance. Based on our empirical analysis, we finally summarize promising directions to shed lights on future research.

Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis

TL;DR

This work tackles information overload in recommendation by systematically studying Large Language Models (LLMs) as recommender systems through a general LLM-RS framework. It formalizes input into natural language prompts and analyzes two main factors: the choice of foundation LLMs (public availability, tuning, architecture, scale, and context length) and the structure of prompts (task descriptions, user interest modeling, candidate item construction, and prompting strategies). Extensive experiments on MovieLens-1M and Amazon Books examine zero-shot ranking and CTR-style fine tuning, yielding insights such as the superiority of closed-source LLMs in cold-start scenarios, the gains from instruction tuning and LoRA-based fine-tuning, and the critical role of prompt design in eliciting effective recommendations. The findings offer practical guidance for deploying LLMs in recommendation, including grounding candidate items, leveraging retrieval-augmented strategies for user interest, and balancing efficiency with performance in real-world systems.

Abstract

Recently, Large Language Models~(LLMs) such as ChatGPT have showcased remarkable abilities in solving general tasks, demonstrating the potential for applications in recommender systems. To assess how effectively LLMs can be used in recommendation tasks, our study primarily focuses on employing LLMs as recommender systems through prompting engineering. We propose a general framework for utilizing LLMs in recommendation tasks, focusing on the capabilities of LLMs as recommenders. To conduct our analysis, we formalize the input of LLMs for recommendation into natural language prompts with two key aspects, and explain how our framework can be generalized to various recommendation scenarios. As for the use of LLMs as recommenders, we analyze the impact of public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results based on the classification of LLMs. As for prompt engineering, we further analyze the impact of four important components of prompts, \ie task descriptions, user interest modeling, candidate items construction and prompting strategies. In each section, we first define and categorize concepts in line with the existing literature. Then, we propose inspiring research questions followed by detailed experiments on two public datasets, in order to systematically analyze the impact of different factors on performance. Based on our empirical analysis, we finally summarize promising directions to shed lights on future research.
Paper Structure (50 sections, 11 figures, 14 tables)

This paper contains 50 sections, 11 figures, 14 tables.

Figures (11)

  • Figure 1: The overall framework of our proposed LLM-RS. In our framework, LLMs are leveraged as recommender systems in four ways: prompting without tuning, full-model fine-tuning, parameter-efficient fine-tuning and instruction tuning; Prompt engineering consists of four components: task description, user interest modeling, candidate items construction and prompting strategies. LLMs act as recommender systems through task-specific prompts, while LLMs provide response and feedback to optimize prompts.
  • Figure 2: Basic prompts for experiments on the MovieLens-1M dataset in LLM-RS. We explore two application scenarios: (1) prompting LLMs without tuning, and (2) fine-tuning LLMs as recommender systems. The experiments focus on sequential re-ranking and Click-Through Rate (CTR) prediction tasks, respectively.
  • Figure 3: The recommendation performance of LLMs w.r.t. the number of historical item sequences. As in \ref{['fig:basic-prompt-ranking']}, we use recently interacted items to model user interest in the zero-shot ranking task. To compare the impact of historical item sequences on LLMs, we illustrate the ranking performance of the traditional sequential recommender SASRec kang2018sasrec and three closed-source LLMs with different numbers of historical items.
  • Figure 4: The recommendation performance and inference time of LLMs w.r.t. the parameter scale. We compare the different parameter scales of LLMs, and the larger the symbol in the figure, the larger the parameter scale.
  • Figure 5: The recommendation performance of LLMs w.r.t. the context length. We compare the different context length of LLMs, and the larger the symbol in the figure, the longer the context length.
  • ...and 6 more figures