Debiasing surgeon: fantastic weights and how to find them
Rémi Nahon, Ivan Luiz De Moura Matos, Van-Tam Nguyen, Enzo Tartaglione
TL;DR
The paper tackles the problem of deep learning biases arising from spurious correlations and proposes Finding Fantastic Weights (FFW) to extract unbiased sub-networks from vanilla trained models without retraining. By appending a bias extractor and learning a gating mask on the encoder, FFW minimizes the leakage of bias information while preserving task accuracy, using a loss that combines task performance with an empirical mutual information term $\mathcal{I}(b, \hat{b})$. It presents a theoretical framework linking biasedness $\phi$ and task bias $K_{bia}$, and provides unstructured and structured pruning variants that guarantee reduced bias leakage under pruning. Empirical results across Biased MNIST, CelebA, Corrupted CIFAR10, and Multi-Color MNIST show that debiased sub-networks indeed exist in vanilla models, achieving competitive task performance with varying sparsity and often outperforming baselines on debiasing metrics, while highlighting that aggressive bias removal is not universally beneficial. The findings suggest a route to energy-efficient debiasing by leveraging architectural sparsity rather than heavy retraining, with implications for safety and regulatory compliance in AI systems.
Abstract
Nowadays an ever-growing concerning phenomenon, the emergence of algorithmic biases that can lead to unfair models, emerges. Several debiasing approaches have been proposed in the realm of deep learning, employing more or less sophisticated approaches to discourage these models from massively employing these biases. However, a question emerges: is this extra complexity really necessary? Is a vanilla-trained model already embodying some ``unbiased sub-networks'' that can be used in isolation and propose a solution without relying on the algorithmic biases? In this work, we show that such a sub-network typically exists, and can be extracted from a vanilla-trained model without requiring additional training. We further validate that such specific architecture is incapable of learning a specific bias, suggesting that there are possible architectural countermeasures to the problem of biases in deep neural networks.
