Large Language Models are Biased Because They Are Large Language Models

Philip Resnik

Large Language Models are Biased Because They Are Large Language Models

Philip Resnik

TL;DR

The paper argues that harmful biases in LLMs are an inevitable consequence of their design and training on human text, formalizing bias via $B = \mathrm{D}(P_f(o|a;X)\|P_f(o;X))^{-1}$. It contends that LLMs encode latent human structure, including normative biases, because of the distributional hypothesis and the scale of pretraining, making bias difficult to separate from useful generalization. RLHF and similar mitigations are criticized for leaking biases through human feedback and failing to address root causes, prompting a call to rethink foundational assumptions and pursue modular, knowledge-grounded representations that distinguish conventional meaning from contextual conveyed meaning. The work advocates cross-disciplinary collaboration to redesign AI foundations and emphasizes the societal dimension of bias, arguing that meaning, normativity, and language structure must be treated on par with distribution in model design. Overall, it argues for a shift away from sole distributional optimization toward principled, normative, and knowledge-enabled AI, with attention to governance and accessibility in deploying safer systems.

Abstract

This position paper's primary goal is to provoke thoughtful discussion about the relationship between bias and fundamental properties of large language models. I do this by seeking to convince the reader that harmful biases are an inevitable consequence arising from the design of any large language model as LLMs are currently formulated. To the extent that this is true, it suggests that the problem of harmful bias cannot be properly addressed without a serious reconsideration of AI driven by LLMs, going back to the foundational assumptions underlying their design.

Large Language Models are Biased Because They Are Large Language Models

TL;DR

The paper argues that harmful biases in LLMs are an inevitable consequence of their design and training on human text, formalizing bias via

. It contends that LLMs encode latent human structure, including normative biases, because of the distributional hypothesis and the scale of pretraining, making bias difficult to separate from useful generalization. RLHF and similar mitigations are criticized for leaking biases through human feedback and failing to address root causes, prompting a call to rethink foundational assumptions and pursue modular, knowledge-grounded representations that distinguish conventional meaning from contextual conveyed meaning. The work advocates cross-disciplinary collaboration to redesign AI foundations and emphasizes the societal dimension of bias, arguing that meaning, normativity, and language structure must be treated on par with distribution in model design. Overall, it argues for a shift away from sole distributional optimization toward principled, normative, and knowledge-enabled AI, with attention to governance and accessibility in deploying safer systems.

Abstract

Paper Structure (17 sections)

This paper contains 17 sections.

Introduction
What is bias?
What are large language models models of?
What underlies human-generated text?
Is this an in-principle problem?
What about RLHF?
How might we fix this?
Conclusions: Where to from here?
Post-conclusions discussion
Is harmful LLM bias actually a thing? What's your evidence that existing mitigation methods aren't enough to prevent unmanageable user impact?
How do we actually know that the relevant distinctions are not discovered distributionally by LLMs?
It was already obvious that harmful biases are baked into LLMs
What about people? People are biased, too
What empirical evidence would convince you that LLMs are making relevant distinctions/not encoding harmful biases in their representations?
Mightn't that lead to less useful models?
...and 2 more sections

Large Language Models are Biased Because They Are Large Language Models

TL;DR

Abstract

Large Language Models are Biased Because They Are Large Language Models

Authors

TL;DR

Abstract

Table of Contents