Table of Contents
Fetching ...

Do Reviews Matter for Recommendations in the Era of Large Language Models?

Chee Heng Tan, Huiying Zheng, Jing Wang, Zhuoyi Lin, Shaodi Feng, Huijing Zhan, Xiaoli Li, J. Senthilnath

TL;DR

The paper investigates whether explicit text reviews remain essential for recommendations in the era of large language models by introducing RAREval, a comprehensive evaluation framework for review-aware systems. It systematically compares traditional DL review-aware models with zero-shot, few-shot, and REVLoRA-fine-tuned LLM approaches across eight public datasets, emphasizing data sparsity and cold-start scenarios. The findings show that LLMs can effectively leverage reviews as part of recommendation engines, often outperforming DL baselines in sparse or cold-start settings, while removal or distortion of reviews does not always degrade performance. The work highlights the nuanced role of text reviews, offers practical guidance on efficient LLM fine-tuning for recommendations, and suggests future directions in prompt design and sequential review-aware rating prediction.

Abstract

With the advent of large language models (LLMs), the landscape of recommender systems is undergoing a significant transformation. Traditionally, user reviews have served as a critical source of rich, contextual information for enhancing recommendation quality. However, as LLMs demonstrate an unprecedented ability to understand and generate human-like text, this raises the question of whether explicit user reviews remain essential in the era of LLMs. In this paper, we provide a systematic investigation of the evolving role of text reviews in recommendation by comparing deep learning methods and LLM approaches. Particularly, we conduct extensive experiments on eight public datasets with LLMs and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We further introduce a benchmarking evaluation framework for review-aware recommender systems, RAREval, to comprehensively assess the contribution of textual reviews to the recommendation performance of review-aware recommender systems. Our framework examines various scenarios, including the removal of some or all textual reviews, random distortion, as well as recommendation performance in data sparsity and cold-start user settings. Our findings demonstrate that LLMs are capable of functioning as effective review-aware recommendation engines, generally outperforming traditional deep learning approaches, particularly in scenarios characterized by data sparsity and cold-start conditions. In addition, the removal of some or all textual reviews and random distortion does not necessarily lead to declines in recommendation accuracy. These findings motivate a rethinking of how user preference from text reviews can be more effectively leveraged. All code and supplementary materials are available at: https://github.com/zhytk/RAREval-data-processing.

Do Reviews Matter for Recommendations in the Era of Large Language Models?

TL;DR

The paper investigates whether explicit text reviews remain essential for recommendations in the era of large language models by introducing RAREval, a comprehensive evaluation framework for review-aware systems. It systematically compares traditional DL review-aware models with zero-shot, few-shot, and REVLoRA-fine-tuned LLM approaches across eight public datasets, emphasizing data sparsity and cold-start scenarios. The findings show that LLMs can effectively leverage reviews as part of recommendation engines, often outperforming DL baselines in sparse or cold-start settings, while removal or distortion of reviews does not always degrade performance. The work highlights the nuanced role of text reviews, offers practical guidance on efficient LLM fine-tuning for recommendations, and suggests future directions in prompt design and sequential review-aware rating prediction.

Abstract

With the advent of large language models (LLMs), the landscape of recommender systems is undergoing a significant transformation. Traditionally, user reviews have served as a critical source of rich, contextual information for enhancing recommendation quality. However, as LLMs demonstrate an unprecedented ability to understand and generate human-like text, this raises the question of whether explicit user reviews remain essential in the era of LLMs. In this paper, we provide a systematic investigation of the evolving role of text reviews in recommendation by comparing deep learning methods and LLM approaches. Particularly, we conduct extensive experiments on eight public datasets with LLMs and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We further introduce a benchmarking evaluation framework for review-aware recommender systems, RAREval, to comprehensively assess the contribution of textual reviews to the recommendation performance of review-aware recommender systems. Our framework examines various scenarios, including the removal of some or all textual reviews, random distortion, as well as recommendation performance in data sparsity and cold-start user settings. Our findings demonstrate that LLMs are capable of functioning as effective review-aware recommendation engines, generally outperforming traditional deep learning approaches, particularly in scenarios characterized by data sparsity and cold-start conditions. In addition, the removal of some or all textual reviews and random distortion does not necessarily lead to declines in recommendation accuracy. These findings motivate a rethinking of how user preference from text reviews can be more effectively leveraged. All code and supplementary materials are available at: https://github.com/zhytk/RAREval-data-processing.

Paper Structure

This paper contains 25 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: A comprehensive comparison of recommendation models evaluated using our RAREval framework, which reports the average MAE across Amazon datasets under review removal, distortion, and reduction. RAREval further assesses model performance under data sparsity ($k$-core, where each user and item has at least $k$ reviews) and cold-start scenarios (CS$k$, where users have at most $k$ interactions).
  • Figure 2: Illustration of prompt formats and workflows for zero-shot, few-shot, and REVLoRA fine-tuning settings.
  • Figure 3: The RAREval framework evaluates review-aware recommender systems across five distinct settings, each derived from the original dataset.
  • Figure 4: MAE was evaluated on four Amazon datasets using Llama 3B model for Zero-shot and Few-shot settings, and the Llama 1B model with LoRA for the finetuning setting (i.e., REVLoRA).
  • Figure 5: MAE results on four datasets. Dark bars denote the default settings (i.e., with review), while light bars indicate performance without reviews.
  • ...and 4 more figures