Table of Contents
Fetching ...

Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs

Wei Xia

TL;DR

This work addresses context-induced bias in aligned LLMs and demonstrates that manipulating final logits is more stable than attempting hidden-layer interventions. The authors introduce two zero-shot decoding strategies—Static Contextual Contrast Decoding (CCD) and Dynamic Semantic-Aware (DSA)—that adjust the final logits without retraining, preserving fluency through constrained generation. Logit-Lens analysis shows bias solidifies in middle-to-late layers, motivating a final-logits intervention that achieves strong debiasing across benchmarks, with Dynamic yielding the largest gains (up to ~70% reduction) and robust cross-dataset performance, especially in multilingual models. The methods offer practical, plug-and-play debiasing with broad applicability to aligned LLMs, albeit with some latency and context-removal considerations.

Abstract

We proposed Static and Dynamic -- two zero-shot logits-layer debiasing methods. Dynamic reduces bias by up to 70% with minimal fluency loss. Logits intervention outperforms hidden-layer approaches. We show semantic-aware logits intervention is stable and effective for debiasing aligned LLMs.

Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs

TL;DR

This work addresses context-induced bias in aligned LLMs and demonstrates that manipulating final logits is more stable than attempting hidden-layer interventions. The authors introduce two zero-shot decoding strategies—Static Contextual Contrast Decoding (CCD) and Dynamic Semantic-Aware (DSA)—that adjust the final logits without retraining, preserving fluency through constrained generation. Logit-Lens analysis shows bias solidifies in middle-to-late layers, motivating a final-logits intervention that achieves strong debiasing across benchmarks, with Dynamic yielding the largest gains (up to ~70% reduction) and robust cross-dataset performance, especially in multilingual models. The methods offer practical, plug-and-play debiasing with broad applicability to aligned LLMs, albeit with some latency and context-removal considerations.

Abstract

We proposed Static and Dynamic -- two zero-shot logits-layer debiasing methods. Dynamic reduces bias by up to 70% with minimal fluency loss. Logits intervention outperforms hidden-layer approaches. We show semantic-aware logits intervention is stable and effective for debiasing aligned LLMs.

Paper Structure

This paper contains 8 sections, 8 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Bias injection trajectories in Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct. The critical layer $l^*$ (peak JSD) occurs earlier in Qwen (layers 12--15) than Llama (layers 15--20).
  • Figure 2: Dose-response curves on StereoSet. The Dynamic method consistently outperforms Static in reducing Stereotype Score while preserving fluency (invalid rate <2% up to $\gamma=20$). Qwen exhibits steeper gains, reflecting greater context sensitivity in multilingual models.
  • Figure 3: Stereotype reduction at $\gamma=5.0$ across datasets. Dynamic achieves up to 70% bias drop on Qwen with Invalid Rate $<$0.7%, demonstrating robust generalization over Static and baselines.