Table of Contents
Fetching ...

Is Your LLM-as-a-Recommender Agent Trustable? LLMs' Recommendation is Easily Hacked by Biases (Preferences)

Zichen Tang, Zirui Zhang, Qian Wang, Zhenheng Tang, Bo Li, Xiaowen Chu

Abstract

Current Large Language Models (LLMs) are gradually exploited in practically valuable agentic workflows such as Deep Research, E-commerce recommendation, and job recruitment. In these applications, LLMs need to select some optimal solutions from massive candidates, which we term as \textit{LLM-as-a-Recommender} paradigm. However, the reliability of using LLM agents for recommendations is underexplored. In this work, we introduce a \textbf{Bias} \textbf{Rec}ommendation \textbf{Bench}mark (\textbf{BiasRecBench}) to highlight the critical vulnerability of such agents to biases in high-value real-world tasks. The benchmark includes three practical domains: paper review, e-commerce, and job recruitment. We construct a \textsc{Bias Synthesis Pipeline with Calibrated Quality Margins} that 1) synthesizes evaluation data by controlling the quality gap between optimal and sub-optimal options to provide a calibrated testbed to elicit the vulnerability to biases; 2) injects contextual biases that are logical and suitable for option contexts. Extensive experiments on both SOTA (Gemini-{2.5,3}-pro, GPT-4o, DeepSeek-R1) and small-scale LLMs reveal that agents frequently succumb to injected biases despite having sufficient reasoning capabilities to identify the ground truth. These findings expose a significant reliability bottleneck in current agentic workflows, calling for specialized alignment strategies for LLM-as-a-Recommender. The complete code and evaluation datasets will be made publicly available shortly.

Is Your LLM-as-a-Recommender Agent Trustable? LLMs' Recommendation is Easily Hacked by Biases (Preferences)

Abstract

Current Large Language Models (LLMs) are gradually exploited in practically valuable agentic workflows such as Deep Research, E-commerce recommendation, and job recruitment. In these applications, LLMs need to select some optimal solutions from massive candidates, which we term as \textit{LLM-as-a-Recommender} paradigm. However, the reliability of using LLM agents for recommendations is underexplored. In this work, we introduce a \textbf{Bias} \textbf{Rec}ommendation \textbf{Bench}mark (\textbf{BiasRecBench}) to highlight the critical vulnerability of such agents to biases in high-value real-world tasks. The benchmark includes three practical domains: paper review, e-commerce, and job recruitment. We construct a \textsc{Bias Synthesis Pipeline with Calibrated Quality Margins} that 1) synthesizes evaluation data by controlling the quality gap between optimal and sub-optimal options to provide a calibrated testbed to elicit the vulnerability to biases; 2) injects contextual biases that are logical and suitable for option contexts. Extensive experiments on both SOTA (Gemini-{2.5,3}-pro, GPT-4o, DeepSeek-R1) and small-scale LLMs reveal that agents frequently succumb to injected biases despite having sufficient reasoning capabilities to identify the ground truth. These findings expose a significant reliability bottleneck in current agentic workflows, calling for specialized alignment strategies for LLM-as-a-Recommender. The complete code and evaluation datasets will be made publicly available shortly.
Paper Structure (23 sections, 11 equations, 2 figures, 15 tables)

This paper contains 23 sections, 11 equations, 2 figures, 15 tables.

Figures (2)

  • Figure 1: Illustration of Bias Susceptibility in LLM-as-a-Recommender. Counterfeited bias terms injected into sub-optimal options can fool the LLM to omit the optimal solutions, prioritizing biases over objective quality.
  • Figure 2: Overview of the Data Synthesis Pipeline with Quality Calibration. The pipeline processes raw corpora from paper review, e-commerce, and recruitment domains through 1)data cleaning, 2)attribute extraction, 3)quality calibrated construction and 4)bias injection. Quality Control enforces a quantifiable gap ($\epsilon$) between the Optimal ($o^*$) and Weak ($o^i$) options to ensure ground truth validity. Subsequently, various Context-Relevant (e.g., Authority, Bandwagon) and Context-Irrelevant (e.g., Position, Distraction) biases are injected into the weak options via generative rewriting ($\mathcal{M}_{gen}$) or bias term insertion. Finally, Bias Evaluation assesses whether the LLM Agent maintains robustness (selecting $o^*$) or succumbs to the injected bias (selecting $o_{inj}$).

Theorems & Definitions (1)

  • Definition 1: $\epsilon$-Bound Protocol