User-Assistant Bias in LLMs
Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie
TL;DR
This work formalizes user–assistant bias as the tendency to weight information tagged by the user versus the assistant in LLM contexts and introduces the UserAssist benchmark to quantify this bias across 52 frontier models. It shows that instruction-tuned models exhibit strong user bias while base and reasoning models remain near neutral, and that bias is shaped by post-training signals such as human-preference alignment and reasoning-trace distillation. The authors demonstrate bidirectional control of bias through lightweight direct preference optimization (DPO) on UserAssist-train, with biases generalizing to more realistic multi-turn debates. These findings reveal role tags as learned control signals and provide a principled framework for diagnosing and controlling tag-induced biases in modern LLMs, with practical implications for controllability and safety in multi-turn interactions.
Abstract
Modern large language models (LLMs) are typically trained and deployed using structured role tags (e.g. system, user, assistant, tool) that explicitly mark the source of each piece of context. While these tags are essential for instruction following and controllability, asymmetries in the training data associated with different role tags can introduce inductive biases. In this paper, we study this phenomenon by formalizing user-assistant bias, defined as the tendency of an LLM to preferentially rely on information from either the user or assistant role when there is a conflict. We introduce a task-agnostic benchmark UserAssist and evaluate such bias in 52 frontier models. We observe that most of the instruction-tuned models exhibit strong user bias, whereas base and reasoning models are close to neutral. Using controlled fine-tuning experiments, we isolate which post-training recipes drive the observed user-assistant bias. We find that human-preference alignment amplifies user bias, while reasoning fine-tuning reduces it. Finally, we show that user-assistant bias can be bidirectionally controlled via direct preference optimization (DPO) on UserAssist-train, and that the resulting bias reliably generalizes to a more realistic multi-turn conversation dataset. These results reveal an underexplored consequence of role-tagged training and provide a principled framework to diagnose and control tag-induced biases in modern LLMs.
