Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs
Wei Xia
TL;DR
This work addresses context-induced bias in aligned LLMs and demonstrates that manipulating final logits is more stable than attempting hidden-layer interventions. The authors introduce two zero-shot decoding strategies—Static Contextual Contrast Decoding (CCD) and Dynamic Semantic-Aware (DSA)—that adjust the final logits without retraining, preserving fluency through constrained generation. Logit-Lens analysis shows bias solidifies in middle-to-late layers, motivating a final-logits intervention that achieves strong debiasing across benchmarks, with Dynamic yielding the largest gains (up to ~70% reduction) and robust cross-dataset performance, especially in multilingual models. The methods offer practical, plug-and-play debiasing with broad applicability to aligned LLMs, albeit with some latency and context-removal considerations.
Abstract
We proposed Static and Dynamic -- two zero-shot logits-layer debiasing methods. Dynamic reduces bias by up to 70% with minimal fluency loss. Logits intervention outperforms hidden-layer approaches. We show semantic-aware logits intervention is stable and effective for debiasing aligned LLMs.
