Large Language Models as Recommender Systems: A Study of Popularity Bias
Jan Malte Lichtenberg, Alexander Buchholz, Pola Schwöbel
TL;DR
The paper tackles popularity bias in recommender systems with general-purpose LLMs by proposing a principled metric framework and introducing the log popularity difference metric. It demonstrates, on MovieLens 10M, that simple LLM-based recommenders can exhibit less popularity bias than traditional baselines, though accuracy may lag behind collaborative filtering. Prompt-based debiasing can reduce bias further, but often trades off predictive performance, suggesting a practical path for deploying LLM-based RS with careful prompting. The work provides a robust assessment framework and highlights the importance of metric choice for measuring and mitigating bias in modern, language-model-driven recommendation systems. Overall, it offers actionable guidance for balancing relevance and exposure across popular and niche items in LLM-enhanced recommender architectures.
Abstract
The issue of popularity bias -- where popular items are disproportionately recommended, overshadowing less popular but potentially relevant items -- remains a significant challenge in recommender systems. Recent advancements have seen the integration of general-purpose Large Language Models (LLMs) into the architecture of such systems. This integration raises concerns that it might exacerbate popularity bias, given that the LLM's training data is likely dominated by popular items. However, it simultaneously presents a novel opportunity to address the bias via prompt tuning. Our study explores this dichotomy, examining whether LLMs contribute to or can alleviate popularity bias in recommender systems. We introduce a principled way to measure popularity bias by discussing existing metrics and proposing a novel metric that fulfills a series of desiderata. Based on our new metric, we compare a simple LLM-based recommender to traditional recommender systems on a movie recommendation task. We find that the LLM recommender exhibits less popularity bias, even without any explicit mitigation.
