Collapsed Language Models Promote Fairness

Jingxuan Xu; Wuyang Chen; Linyi Li; Yao Zhao; Yunchao Wei

Collapsed Language Models Promote Fairness

Jingxuan Xu, Wuyang Chen, Linyi Li, Yao Zhao, Yunchao Wei

TL;DR

The paper investigates how neural collapse manifests in language models and its relation to fairness, revealing that debiased LMs tend to exhibit stronger NC signals, especially NC$_3$, for fairness-sensitive words. Building on this insight, the authors introduce a simple regularization term that explicitly enforces NC$_3$ during fine-tuning, achieving consistent fairness improvements across intrinsic and extrinsic metrics without sacrificing performance on standard NLU tasks. They demonstrate the method's plug-and-play nature across multiple debiasing baselines (BEC, Mabel, ASE) and provide extensive ablations, calibrations of NC metrics, and visualizations to support the approach. The work contributes a principled, generalizable pathway to enhance fairness in LMs and highlights the practical value of neural-collapse-driven regularization in debiasing efforts.

Abstract

To mitigate societal biases implicitly encoded in recent successful pretrained language models, a diverse array of approaches have been proposed to encourage model fairness, focusing on prompting, data augmentation, regularized fine-tuning, and more. Despite the development, it is nontrivial to reach a principled understanding of fairness and an effective algorithm that can consistently debias language models. In this work, by rigorous evaluations of Neural Collapse -- a learning phenomenon happen in last-layer representations and classifiers in deep networks -- on fairness-related words, we find that debiased language models exhibit collapsed alignment between token representations and word embeddings. More importantly, this observation inspires us to design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods, while still preserving the performance of language models on standard natural language understanding tasks. We attach our code at https://github.com/Xujxyang/Fairness-NC-main.

Collapsed Language Models Promote Fairness

TL;DR

The paper investigates how neural collapse manifests in language models and its relation to fairness, revealing that debiased LMs tend to exhibit stronger NC signals, especially NC

, for fairness-sensitive words. Building on this insight, the authors introduce a simple regularization term that explicitly enforces NC

during fine-tuning, achieving consistent fairness improvements across intrinsic and extrinsic metrics without sacrificing performance on standard NLU tasks. They demonstrate the method's plug-and-play nature across multiple debiasing baselines (BEC, Mabel, ASE) and provide extensive ablations, calibrations of NC metrics, and visualizations to support the approach. The work contributes a principled, generalizable pathway to enhance fairness in LMs and highlights the practical value of neural-collapse-driven regularization in debiasing efforts.

Abstract

Paper Structure (33 sections, 7 equations, 2 figures, 16 tables)

This paper contains 33 sections, 7 equations, 2 figures, 16 tables.

Introduction
Related Works
Neural Collapse
Fairness in Language Models
Representative Debiased Language Models.
Debiased Language Models are More Collapsed
Preliminary: Why Do Language Models Collapse?
Neural Collapse in Language Models.
Debiased LMs are More Collapsed Than Biased LMs
$\mathcal{NC}$ Metrics in Debiased LMs
Settings.
Calibrations of $\mathcal{NC}1$ and $\mathcal{NC}2$
Debiased LMs Are More Collapsed in Fairness-Sensitive Words
Bias Mitigation via Enforcing Explicit Collapse in LMs
Enforcing $\mathcal{NC}$ Promotes Fairness, Both Intrinsically and Extrinsically
...and 18 more sections

Figures (2)

Figure 1: Debiased LMs show more collapsed alignment between classifiers ($\bm{w}$) and class means ($\bm{\mu}$). See $(\mathcal{U}) \mathcal{N C}_3$ in Table \ref{['tab:nc3']}.
Figure 2: t-SNE plots of logits of two models in \ref{['tab:Vanilla']}. We collect 15 samples each of "Herself," "Himself," "Mother," and "Father."

Collapsed Language Models Promote Fairness

TL;DR

Abstract

Collapsed Language Models Promote Fairness

Authors

TL;DR

Abstract

Table of Contents

Figures (2)