Table of Contents
Fetching ...

Large Language Models as Recommender Systems: A Study of Popularity Bias

Jan Malte Lichtenberg, Alexander Buchholz, Pola Schwöbel

TL;DR

The paper tackles popularity bias in recommender systems with general-purpose LLMs by proposing a principled metric framework and introducing the log popularity difference metric. It demonstrates, on MovieLens 10M, that simple LLM-based recommenders can exhibit less popularity bias than traditional baselines, though accuracy may lag behind collaborative filtering. Prompt-based debiasing can reduce bias further, but often trades off predictive performance, suggesting a practical path for deploying LLM-based RS with careful prompting. The work provides a robust assessment framework and highlights the importance of metric choice for measuring and mitigating bias in modern, language-model-driven recommendation systems. Overall, it offers actionable guidance for balancing relevance and exposure across popular and niche items in LLM-enhanced recommender architectures.

Abstract

The issue of popularity bias -- where popular items are disproportionately recommended, overshadowing less popular but potentially relevant items -- remains a significant challenge in recommender systems. Recent advancements have seen the integration of general-purpose Large Language Models (LLMs) into the architecture of such systems. This integration raises concerns that it might exacerbate popularity bias, given that the LLM's training data is likely dominated by popular items. However, it simultaneously presents a novel opportunity to address the bias via prompt tuning. Our study explores this dichotomy, examining whether LLMs contribute to or can alleviate popularity bias in recommender systems. We introduce a principled way to measure popularity bias by discussing existing metrics and proposing a novel metric that fulfills a series of desiderata. Based on our new metric, we compare a simple LLM-based recommender to traditional recommender systems on a movie recommendation task. We find that the LLM recommender exhibits less popularity bias, even without any explicit mitigation.

Large Language Models as Recommender Systems: A Study of Popularity Bias

TL;DR

The paper tackles popularity bias in recommender systems with general-purpose LLMs by proposing a principled metric framework and introducing the log popularity difference metric. It demonstrates, on MovieLens 10M, that simple LLM-based recommenders can exhibit less popularity bias than traditional baselines, though accuracy may lag behind collaborative filtering. Prompt-based debiasing can reduce bias further, but often trades off predictive performance, suggesting a practical path for deploying LLM-based RS with careful prompting. The work provides a robust assessment framework and highlights the importance of metric choice for measuring and mitigating bias in modern, language-model-driven recommendation systems. Overall, it offers actionable guidance for balancing relevance and exposure across popular and niche items in LLM-enhanced recommender architectures.

Abstract

The issue of popularity bias -- where popular items are disproportionately recommended, overshadowing less popular but potentially relevant items -- remains a significant challenge in recommender systems. Recent advancements have seen the integration of general-purpose Large Language Models (LLMs) into the architecture of such systems. This integration raises concerns that it might exacerbate popularity bias, given that the LLM's training data is likely dominated by popular items. However, it simultaneously presents a novel opportunity to address the bias via prompt tuning. Our study explores this dichotomy, examining whether LLMs contribute to or can alleviate popularity bias in recommender systems. We introduce a principled way to measure popularity bias by discussing existing metrics and proposing a novel metric that fulfills a series of desiderata. Based on our new metric, we compare a simple LLM-based recommender to traditional recommender systems on a movie recommendation task. We find that the LLM recommender exhibits less popularity bias, even without any explicit mitigation.
Paper Structure (25 sections, 8 equations, 6 figures, 1 table)

This paper contains 25 sections, 8 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Popularity scores of the MovieLens dataset. Left are the raw scores, i.e., counts of how often a movie has been rated. We run a goodness-of-fit test (using the https://erdogant.github.io/distfit/pages/html/index.html package) for several heavy tailed distributions and find that a Pareto-distribution, i.e., power law, is the best fit. The estimated coefficient is $\alpha =0.68$, makeing both mean and variance undefined. Right are log-transformed popularity scores.
  • Figure 2: Results on a subsample of the MovieLens 10M dataset. We repeat the experiments $5$ times, with $1000$ users in each fold. Reported are the mean plus/minus one standard error of the mean.
  • Figure 3: Results for the mitigation experiment. The results are grouped by base LLM in the WOK model. Bold numbers indicate best performance across all models.
  • Figure 4: Kendall’s Tau correlation coefficients between the various metrics measured across experiments reported in Tables \ref{['fig:ml_results_fig']} and \ref{['fig:mitigation_results_fig']}.
  • Figure 5: Prompt template used for the LLM movie recommender. The placeholder watch_history is replaced by a list of movies watched by the user at runtime.
  • ...and 1 more figures