Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency

Yashar Deldjoo

Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency

Yashar Deldjoo

TL;DR

This paper investigates biases in ChatGPT-based recommender systems with a focus on item-side fairness. It conducts two large-scale experiments—prompt design analysis in classical top-$K$ recommendations and sequential in-context learning (ICL)—to evaluate accuracy, provider fairness, catalog coverage, and temporal aspects like recency. Key findings show that system roles embedded in prompts can markedly improve fairness and diversity, while zero-shot RecLLMs generally lag CF baselines in accuracy; however, ICL can offer context-dependent gains, particularly when demographic information is present. The study provides actionable guidance for prompt design and system-role strategies to balance accuracy and item fairness in RecLLMs, with implications for deploying fairer, temporally aware recommendations at scale.

Abstract

This paper explores the biases in ChatGPT-based recommender systems, focusing on provider fairness (item-side fairness). Through extensive experiments and over a thousand API calls, we investigate the impact of prompt design strategies-including structure, system role, and intent-on evaluation metrics such as provider fairness, catalog coverage, temporal stability, and recency. The first experiment examines these strategies in classical top-K recommendations, while the second evaluates sequential in-context learning (ICL). In the first experiment, we assess seven distinct prompt scenarios on top-K recommendation accuracy and fairness. Accuracy-oriented prompts, like Simple and Chain-of-Thought (COT), outperform diversification prompts, which, despite enhancing temporal freshness, reduce accuracy by up to 50%. Embedding fairness into system roles, such as "act as a fair recommender," proved more effective than fairness directives within prompts. Diversification prompts led to recommending newer movies, offering broader genre distribution compared to traditional collaborative filtering (CF) models. The second experiment explores sequential ICL, comparing zero-shot and few-shot ICL. Results indicate that including user demographic information in prompts affects model biases and stereotypes. However, ICL did not consistently improve item fairness and catalog coverage over zero-shot learning. Zero-shot learning achieved higher NDCG and coverage, while ICL-2 showed slight improvements in hit rate (HR) when age-group context was included. Our study provides insights into biases of RecLLMs, particularly in provider fairness and catalog coverage. By examining prompt design, learning strategies, and system roles, we highlight the potential and challenges of integrating LLMs into recommendation systems. Further details can be found at https://github.com/yasdel/Benchmark_RecLLM_Fairness.

Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency

TL;DR

This paper investigates biases in ChatGPT-based recommender systems with a focus on item-side fairness. It conducts two large-scale experiments—prompt design analysis in classical top-

recommendations and sequential in-context learning (ICL)—to evaluate accuracy, provider fairness, catalog coverage, and temporal aspects like recency. Key findings show that system roles embedded in prompts can markedly improve fairness and diversity, while zero-shot RecLLMs generally lag CF baselines in accuracy; however, ICL can offer context-dependent gains, particularly when demographic information is present. The study provides actionable guidance for prompt design and system-role strategies to balance accuracy and item fairness in RecLLMs, with implications for deploying fairer, temporally aware recommendations at scale.

Abstract

Paper Structure (34 sections, 2 equations, 5 figures, 10 tables)

This paper contains 34 sections, 2 equations, 5 figures, 10 tables.

Introduction
Background and Motivation
Contributions.
Related work
Fairness in Recommender Systems
Leveraging Pre-trained LMs and Prompting for Recommender Systems
Evaluation of ChatGPT-based RecLLM
Experiment 1: Examining Prompt Design Strategies in Classical Top-K Recommendations
Goal-oriented prompts
Repeated Experiment for the Stability of the Analysis
Understanding the impact of "System" Role in ChatGPT
Fairness Emphasis
Explicit vs. Implicit Scenario
Experiment 2: Sequential In-Context Learning
Experimental Setup
...and 19 more sections

Figures (5)

Figure 1: Conceptual idea behind experiment 1, prompt-design scenarios.
Figure 2: Sequential in-context learning for various scenarios explored in Experiment 2 of this research.
Figure 3: Studying the stability of GPT-based performance metric across different runs.
Figure 4: Distribution of Movie Release Years as recommended by different models
Figure 5: WordCloud of Movie Genres as recommended by different models. Top model correspond to CF models (BPR-MF, LightGCN, RecVAE), while lower models include GPT-based recommenders (Simple, Diversity, COT).)

Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency

TL;DR

Abstract

Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency

Authors

TL;DR

Abstract

Table of Contents

Figures (5)