Large Language Models are Biased Because They Are Large Language Models
Philip Resnik
TL;DR
The paper argues that harmful biases in LLMs are an inevitable consequence of their design and training on human text, formalizing bias via $B = \mathrm{D}(P_f(o|a;X)\|P_f(o;X))^{-1}$. It contends that LLMs encode latent human structure, including normative biases, because of the distributional hypothesis and the scale of pretraining, making bias difficult to separate from useful generalization. RLHF and similar mitigations are criticized for leaking biases through human feedback and failing to address root causes, prompting a call to rethink foundational assumptions and pursue modular, knowledge-grounded representations that distinguish conventional meaning from contextual conveyed meaning. The work advocates cross-disciplinary collaboration to redesign AI foundations and emphasizes the societal dimension of bias, arguing that meaning, normativity, and language structure must be treated on par with distribution in model design. Overall, it argues for a shift away from sole distributional optimization toward principled, normative, and knowledge-enabled AI, with attention to governance and accessibility in deploying safer systems.
Abstract
This position paper's primary goal is to provoke thoughtful discussion about the relationship between bias and fundamental properties of large language models. I do this by seeking to convince the reader that harmful biases are an inevitable consequence arising from the design of any large language model as LLMs are currently formulated. To the extent that this is true, it suggests that the problem of harmful bias cannot be properly addressed without a serious reconsideration of AI driven by LLMs, going back to the foundational assumptions underlying their design.
