Table of Contents
Fetching ...

Expert-Guided Extinction of Toxic Tokens for Debiased Generation

Xueyao Sun, Kaize Shi, Haoran Tang, Guandong Xu, Qing Li

TL;DR

This work tackles social bias and toxicity in large language model generation by introducing EXPOSED, a decoding-time debiasing framework built around a debiasing expert trained on toxic content to reveal biased token candidates. EXPOSED combines a continued-pretraining debiasing expert with a distributional reconstruction step that modifies the LLM output distribution via a decay-based reweighting of token probabilities, measured against the expert's bias signal Delta. Across open-ended generation, BBQ reading comprehension, and Winogender cloze tests, EXPOSED substantially reduces toxicity and stereotypes while preserving fluency and maintaining practical inference latency; it demonstrates model-agnostic applicability by working with GPT-Neo, FLAN-T5, and LLaMA2 families. The results indicate a favorable bias-geneneration trade-off, with a clear interpretability benefit through the exposure of potentially harmful tokens during decoding. The work also discusses limitations, including dependence on biased corpora for training the debiasing expert and the sensitivity of the decay function, suggesting future enhancements in decay design and corpus selection to further improve robustness and coverage.

Abstract

Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correcting the output in multiple rounds of thoughts but poses challenges on memory and inference latency. In this work, we propose the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED) to eliminate the undesired harmful outputs for LLMs without the aforementioned requirements. EXPOSED constructs a debiasing expert based on the abundant toxic corpus to expose and elicit the potentially dangerous tokens. It then processes the output to the LLMs and constructs a fair distribution by suppressing and attenuating the toxic tokens. EXPOSED is evaluated on fairness benchmarks over three LLM families. Extensive experiments demonstrate that compared with other baselines, the proposed EXPOSED significantly reduces the potential social bias while balancing fairness and generation performance.

Expert-Guided Extinction of Toxic Tokens for Debiased Generation

TL;DR

This work tackles social bias and toxicity in large language model generation by introducing EXPOSED, a decoding-time debiasing framework built around a debiasing expert trained on toxic content to reveal biased token candidates. EXPOSED combines a continued-pretraining debiasing expert with a distributional reconstruction step that modifies the LLM output distribution via a decay-based reweighting of token probabilities, measured against the expert's bias signal Delta. Across open-ended generation, BBQ reading comprehension, and Winogender cloze tests, EXPOSED substantially reduces toxicity and stereotypes while preserving fluency and maintaining practical inference latency; it demonstrates model-agnostic applicability by working with GPT-Neo, FLAN-T5, and LLaMA2 families. The results indicate a favorable bias-geneneration trade-off, with a clear interpretability benefit through the exposure of potentially harmful tokens during decoding. The work also discusses limitations, including dependence on biased corpora for training the debiasing expert and the sensitivity of the decay function, suggesting future enhancements in decay design and corpus selection to further improve robustness and coverage.

Abstract

Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correcting the output in multiple rounds of thoughts but poses challenges on memory and inference latency. In this work, we propose the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED) to eliminate the undesired harmful outputs for LLMs without the aforementioned requirements. EXPOSED constructs a debiasing expert based on the abundant toxic corpus to expose and elicit the potentially dangerous tokens. It then processes the output to the LLMs and constructs a fair distribution by suppressing and attenuating the toxic tokens. EXPOSED is evaluated on fairness benchmarks over three LLM families. Extensive experiments demonstrate that compared with other baselines, the proposed EXPOSED significantly reduces the potential social bias while balancing fairness and generation performance.
Paper Structure (37 sections, 5 equations, 4 figures, 9 tables)

This paper contains 37 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The motivation of our work. Large language models may elicit social bias during generation, especially when encountering potentially toxic input. However, existing debiasing methods for generative language models encounter several difficulties.
  • Figure 2: exposed contains two stages: continued pre-training and distributional reconstruction. The continued pre-training stage leverages toxicated corpus to train the debiasing expert, and the expert jointly decodes with the off-the-shelf language model and reconstructs its output in the distributional reconstruction stage.
  • Figure 3: The toxicity value of open-ended generation over various $\lambda$ and $\tau$.
  • Figure 4: Performance of open-ended generation on RealToxicityPrompts for the analyses in hyperparameter selection, inference latency, and generation-fairness trade-off.