Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis
Lanling Xu, Junjie Zhang, Bingqian Li, Jinpeng Wang, Sheng Chen, Wayne Xin Zhao, Ji-Rong Wen
TL;DR
This work tackles information overload in recommendation by systematically studying Large Language Models (LLMs) as recommender systems through a general LLM-RS framework. It formalizes input into natural language prompts and analyzes two main factors: the choice of foundation LLMs (public availability, tuning, architecture, scale, and context length) and the structure of prompts (task descriptions, user interest modeling, candidate item construction, and prompting strategies). Extensive experiments on MovieLens-1M and Amazon Books examine zero-shot ranking and CTR-style fine tuning, yielding insights such as the superiority of closed-source LLMs in cold-start scenarios, the gains from instruction tuning and LoRA-based fine-tuning, and the critical role of prompt design in eliciting effective recommendations. The findings offer practical guidance for deploying LLMs in recommendation, including grounding candidate items, leveraging retrieval-augmented strategies for user interest, and balancing efficiency with performance in real-world systems.
Abstract
Recently, Large Language Models~(LLMs) such as ChatGPT have showcased remarkable abilities in solving general tasks, demonstrating the potential for applications in recommender systems. To assess how effectively LLMs can be used in recommendation tasks, our study primarily focuses on employing LLMs as recommender systems through prompting engineering. We propose a general framework for utilizing LLMs in recommendation tasks, focusing on the capabilities of LLMs as recommenders. To conduct our analysis, we formalize the input of LLMs for recommendation into natural language prompts with two key aspects, and explain how our framework can be generalized to various recommendation scenarios. As for the use of LLMs as recommenders, we analyze the impact of public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results based on the classification of LLMs. As for prompt engineering, we further analyze the impact of four important components of prompts, \ie task descriptions, user interest modeling, candidate items construction and prompting strategies. In each section, we first define and categorize concepts in line with the existing literature. Then, we propose inspiring research questions followed by detailed experiments on two public datasets, in order to systematically analyze the impact of different factors on performance. Based on our empirical analysis, we finally summarize promising directions to shed lights on future research.
