Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models
Zhouhao Sun, Xiao Ding, Li Du, Yunpeng Xu, Yixuan Ma, Yang Zhao, Bing Qin, Ting Liu
TL;DR
This work addresses the generalization gap in instruction-tuned LLMs caused by dataset biases. It introduces ICD, which automatically identifies biased features using CAL, enforces a zero-information-gain condition ($IG(Y,B)=0$) so biased features offer no predictive information, and rewrites data via causal interventions before standard fine-tuning. Empirical results show ICD improves transfer and challenge generalization across two LLMs and preserves general abilities, with analyses of model confidence and case studies illustrating robustness. The approach offers a principled, automatic debiasing framework that integrates information theory and causal data rewriting to enhance LLM reliability in diverse tasks.
Abstract
Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (ICD) framework. To eliminate biases within the instruction-tuning dataset, it is essential to ensure that these biases do not provide any additional information to predict the answers, i.e., the information gain of these biases for predicting the answers needs to be 0. Under this guidance, this framework utilizes a causal intervention-based data rewriting method to automatically and autonomously balance the distribution of instruction-tuning dataset for reducing the information gain. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that ICD can effectively debias LLM to improve its generalizability across different tasks.
