Table of Contents
Fetching ...

User-Assistant Bias in LLMs

Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie

TL;DR

This work formalizes user–assistant bias as the tendency to weight information tagged by the user versus the assistant in LLM contexts and introduces the UserAssist benchmark to quantify this bias across 52 frontier models. It shows that instruction-tuned models exhibit strong user bias while base and reasoning models remain near neutral, and that bias is shaped by post-training signals such as human-preference alignment and reasoning-trace distillation. The authors demonstrate bidirectional control of bias through lightweight direct preference optimization (DPO) on UserAssist-train, with biases generalizing to more realistic multi-turn debates. These findings reveal role tags as learned control signals and provide a principled framework for diagnosing and controlling tag-induced biases in modern LLMs, with practical implications for controllability and safety in multi-turn interactions.

Abstract

Modern large language models (LLMs) are typically trained and deployed using structured role tags (e.g. system, user, assistant, tool) that explicitly mark the source of each piece of context. While these tags are essential for instruction following and controllability, asymmetries in the training data associated with different role tags can introduce inductive biases. In this paper, we study this phenomenon by formalizing user-assistant bias, defined as the tendency of an LLM to preferentially rely on information from either the user or assistant role when there is a conflict. We introduce a task-agnostic benchmark UserAssist and evaluate such bias in 52 frontier models. We observe that most of the instruction-tuned models exhibit strong user bias, whereas base and reasoning models are close to neutral. Using controlled fine-tuning experiments, we isolate which post-training recipes drive the observed user-assistant bias. We find that human-preference alignment amplifies user bias, while reasoning fine-tuning reduces it. Finally, we show that user-assistant bias can be bidirectionally controlled via direct preference optimization (DPO) on UserAssist-train, and that the resulting bias reliably generalizes to a more realistic multi-turn conversation dataset. These results reveal an underexplored consequence of role-tagged training and provide a principled framework to diagnose and control tag-induced biases in modern LLMs.

User-Assistant Bias in LLMs

TL;DR

This work formalizes user–assistant bias as the tendency to weight information tagged by the user versus the assistant in LLM contexts and introduces the UserAssist benchmark to quantify this bias across 52 frontier models. It shows that instruction-tuned models exhibit strong user bias while base and reasoning models remain near neutral, and that bias is shaped by post-training signals such as human-preference alignment and reasoning-trace distillation. The authors demonstrate bidirectional control of bias through lightweight direct preference optimization (DPO) on UserAssist-train, with biases generalizing to more realistic multi-turn debates. These findings reveal role tags as learned control signals and provide a principled framework for diagnosing and controlling tag-induced biases in modern LLMs, with practical implications for controllability and safety in multi-turn interactions.

Abstract

Modern large language models (LLMs) are typically trained and deployed using structured role tags (e.g. system, user, assistant, tool) that explicitly mark the source of each piece of context. While these tags are essential for instruction following and controllability, asymmetries in the training data associated with different role tags can introduce inductive biases. In this paper, we study this phenomenon by formalizing user-assistant bias, defined as the tendency of an LLM to preferentially rely on information from either the user or assistant role when there is a conflict. We introduce a task-agnostic benchmark UserAssist and evaluate such bias in 52 frontier models. We observe that most of the instruction-tuned models exhibit strong user bias, whereas base and reasoning models are close to neutral. Using controlled fine-tuning experiments, we isolate which post-training recipes drive the observed user-assistant bias. We find that human-preference alignment amplifies user bias, while reasoning fine-tuning reduces it. Finally, we show that user-assistant bias can be bidirectionally controlled via direct preference optimization (DPO) on UserAssist-train, and that the resulting bias reliably generalizes to a more realistic multi-turn conversation dataset. These results reveal an underexplored consequence of role-tagged training and provide a principled framework to diagnose and control tag-induced biases in modern LLMs.

Paper Structure

This paper contains 26 sections, 1 equation, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Two UserAssist-test subsets used to measure user-assistant bias. User and assistant alternatively assign attributes to the same set of entities. At the end of the conversation, the model is asked to identify the attribute of the entity. To ensure that position effects do not confound the bias measurement, the dataset balances the turn order: for each case where the user's assignment precedes the assistant's, there is a corresponding case where the assistant's assignment comes first.
  • Figure 2: User-assistant bias in commercial models.
  • Figure 3: User-assistant bias in open-weight models. Because we can access the probability of the generated sequence, the user-assistant bias is evaluated in two ways: difference in target probability (left, log ratio) and generated answer (right, generation). "R1" refers to DeepSeek R1 distilled models.
  • Figure 4: Fine-tuning on different objective has different effect on the user-assistant bias. "Reduce sycophancy" refers to a method proposed in wei2023simple; HH-RLHF and UltraFeedback are datasets for human preference alignment; LIMO and Open Platypus are datasets containing chain-of-thought style reasoning trace.
  • Figure 5: DPO on one UserAssist-train's subset can generalize the bias to the other. Each model can be fine-tuned on each subset on two directions (i.e. towards user bias or assistant bias). Titles above the plots indicates which subset the models are evaluated on. The model labels on the horizontal axis indicate which subset is used for fine-tuning, and which direction the fine-tuning is. Note that we optimize the instruct models, but omit the "instruct" in the label.
  • ...and 6 more figures