Table of Contents
Fetching ...

Does LLM Focus on the Right Words? Diagnosing Language Bias in LLM-based Recommenders

Bohao Wang, Jiawei Chen, Feng Liu, Changwang Zhang, Jun Wang, Canghong Jin, Chun Chen, Can Wang

TL;DR

This work identifies a critical language bias emerging when fine-tuning LLM-based recommender systems, where models over-rely on auxiliary tokens at the expense of user interaction signals. It introduces GDRT, a Group Distributionally Robust Optimization-based tuning method that partitions token targets by their relevance to auxiliary prompts and optimizes performance across groups via a closed-form Group DRO objective, thus shifting attention toward informative user signals. Empirical results on three Amazon datasets show that GDRT delivers substantial accuracy gains (average NDCG@10 +24.29%) and fairness improvements (average MGU@5 −37.43%) over state-of-the-art baselines, while remaining efficient and easy to integrate with other LLM-based RS. The findings highlight the practical value of group-aware robust optimization for mitigating shortcut learning in language-grounded recommendations and improving user-centric fairness in large-scale systems.

Abstract

Large language models (LLMs), owing to their extensive open-domain knowledge and semantic reasoning capabilities, have been increasingly integrated into recommender systems (RS). However, a substantial gap remains between the pre-training objectives of LLMs and the specific requirements of recommendation tasks. To address this gap, supervised fine-tuning (SFT) is commonly performed on specially curated recommendation datasets to further enhance their predictive ability. Despite its success, SFT exhibits a critical limitation: it induces Language Bias, whereby the model over-relies on auxiliary tokens-such as task descriptions and prefix-generated tokens-while underutilizing core user interaction tokens that encode user-specific preferences. This bias not only undermines recommendation accuracy but also raises unfairness concerns. To address this issue, we propose Group Distributionally Robust Optimization-based Tuning (GDRT), a novel fine-tuning paradigm that enforces consistent model performance across token groups with varying degrees of relevance to auxiliary tokens. By adaptively upweighting underperforming groups, typically those weakly correlated with auxiliary tokens, GDRT shifts the model's attention from superficial auxiliary cues to informative user interaction tokens, thereby mitigating language bias. Extensive experiments conducted on three public datasets demonstrate that GDRT effectively mitigates language bias, yielding substantial improvements in recommendation accuracy (with an average NDCG@10 gain of 24.29%) and significantly enhancing recommendation fairness.

Does LLM Focus on the Right Words? Diagnosing Language Bias in LLM-based Recommenders

TL;DR

This work identifies a critical language bias emerging when fine-tuning LLM-based recommender systems, where models over-rely on auxiliary tokens at the expense of user interaction signals. It introduces GDRT, a Group Distributionally Robust Optimization-based tuning method that partitions token targets by their relevance to auxiliary prompts and optimizes performance across groups via a closed-form Group DRO objective, thus shifting attention toward informative user signals. Empirical results on three Amazon datasets show that GDRT delivers substantial accuracy gains (average NDCG@10 +24.29%) and fairness improvements (average MGU@5 −37.43%) over state-of-the-art baselines, while remaining efficient and easy to integrate with other LLM-based RS. The findings highlight the practical value of group-aware robust optimization for mitigating shortcut learning in language-grounded recommendations and improving user-centric fairness in large-scale systems.

Abstract

Large language models (LLMs), owing to their extensive open-domain knowledge and semantic reasoning capabilities, have been increasingly integrated into recommender systems (RS). However, a substantial gap remains between the pre-training objectives of LLMs and the specific requirements of recommendation tasks. To address this gap, supervised fine-tuning (SFT) is commonly performed on specially curated recommendation datasets to further enhance their predictive ability. Despite its success, SFT exhibits a critical limitation: it induces Language Bias, whereby the model over-relies on auxiliary tokens-such as task descriptions and prefix-generated tokens-while underutilizing core user interaction tokens that encode user-specific preferences. This bias not only undermines recommendation accuracy but also raises unfairness concerns. To address this issue, we propose Group Distributionally Robust Optimization-based Tuning (GDRT), a novel fine-tuning paradigm that enforces consistent model performance across token groups with varying degrees of relevance to auxiliary tokens. By adaptively upweighting underperforming groups, typically those weakly correlated with auxiliary tokens, GDRT shifts the model's attention from superficial auxiliary cues to informative user interaction tokens, thereby mitigating language bias. Extensive experiments conducted on three public datasets demonstrate that GDRT effectively mitigates language bias, yielding substantial improvements in recommendation accuracy (with an average NDCG@10 gain of 24.29%) and significantly enhancing recommendation fairness.

Paper Structure

This paper contains 26 sections, 1 theorem, 13 equations, 12 figures, 3 tables.

Key Result

Lemma 1

Equation eq:GDRT_loss can be reformulated as the following objective: The parameter $\tau$ is the dual Lagrange coefficient associated with the constraint $D_{KL}(Q, U)\leq\eta$.

Figures (12)

  • Figure 1: Illustration of LLM‑based recommendations and language bias, wherein the model exhibits an over‑reliance on auxiliary tokens and insufficient utilization of interaction tokens during generation.
  • Figure 2: Ratio of attribution values between auxiliary tokens and user interaction tokens before and after SFT. Left: task description vs. user interaction tokens. Right: prefix tokens of predicted item (take the first token) vs. user interaction tokens.
  • Figure 3: Distribution of Top‑1 recommended items generated by SFT-trained LLMs across five group defined according to the semantic relevance of items to the auxiliary tokens (Group 1: highest relevance, Group 5: lowest relevance). We also present the distribution of the test set across the same five groups for comparison.
  • Figure 4: Proportion of Top‑1 recommendations belonging to Item Group 1 (most relevant with auxiliary tokens) over the course of SFT.
  • Figure 5: The co-occurrence rate of different types of token pairs in the training set.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Lemma 1