Expert-Guided Extinction of Toxic Tokens for Debiased Generation
Xueyao Sun, Kaize Shi, Haoran Tang, Guandong Xu, Qing Li
TL;DR
This work tackles social bias and toxicity in large language model generation by introducing EXPOSED, a decoding-time debiasing framework built around a debiasing expert trained on toxic content to reveal biased token candidates. EXPOSED combines a continued-pretraining debiasing expert with a distributional reconstruction step that modifies the LLM output distribution via a decay-based reweighting of token probabilities, measured against the expert's bias signal Delta. Across open-ended generation, BBQ reading comprehension, and Winogender cloze tests, EXPOSED substantially reduces toxicity and stereotypes while preserving fluency and maintaining practical inference latency; it demonstrates model-agnostic applicability by working with GPT-Neo, FLAN-T5, and LLaMA2 families. The results indicate a favorable bias-geneneration trade-off, with a clear interpretability benefit through the exposure of potentially harmful tokens during decoding. The work also discusses limitations, including dependence on biased corpora for training the debiasing expert and the sensitivity of the decay function, suggesting future enhancements in decay design and corpus selection to further improve robustness and coverage.
Abstract
Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correcting the output in multiple rounds of thoughts but poses challenges on memory and inference latency. In this work, we propose the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED) to eliminate the undesired harmful outputs for LLMs without the aforementioned requirements. EXPOSED constructs a debiasing expert based on the abundant toxic corpus to expose and elicit the potentially dangerous tokens. It then processes the output to the LLMs and constructs a fair distribution by suppressing and attenuating the toxic tokens. EXPOSED is evaluated on fairness benchmarks over three LLM families. Extensive experiments demonstrate that compared with other baselines, the proposed EXPOSED significantly reduces the potential social bias while balancing fairness and generation performance.
