Reducing Bias in Pre-trained Models by Tuning while Penalizing Change
Niklas Penzel, Gideon Stein, Joachim Denzler
TL;DR
This work tackles post-hoc debiasing of pre-trained image classifiers by freezing a backbone and learning a zero-initialized change network that is added to the forward pass, with a loss $\mathcal{L}_{mc} = \mathcal{L}(f_{\theta + \theta'}(x), y) + \lambda \|\theta'\|$ to penalize parameter change. By applying either $\ell_1$, $\ell_2$, or a combination of both norms for $\|\theta'\|$, and employing an early stopping criterion based on correctly predicting a tuning batch with an additional minimum step delay $\epsilon$, the method achieves bias mitigation with very few tuning examples. Across four bias/domain-shift datasets (ISIC melanoma, CelebA hair color, Waterbirds, Camelyon17), the approach often yields improved unbiased-test performance for small tuning sets, while standard fine-tuning with early stopping can match or exceed gains for larger tuning sets. The results demonstrate a practical, data-efficient debiasing strategy that minimizes changes to the pre-trained parameters and can be integrated with existing baselines to reduce overfitting.
Abstract
Deep models trained on large amounts of data often incorporate implicit biases present during training time. If later such a bias is discovered during inference or deployment, it is often necessary to acquire new data and retrain the model. This behavior is especially problematic in critical areas such as autonomous driving or medical decision-making. In these scenarios, new data is often expensive and hard to come by. In this work, we present a method based on change penalization that takes a pre-trained model and adapts the weights to mitigate a previously detected bias. We achieve this by tuning a zero-initialized copy of a frozen pre-trained network. Our method needs very few, in extreme cases only a single, examples that contradict the bias to increase performance. Additionally, we propose an early stopping criterion to modify baselines and reduce overfitting. We evaluate our approach on a well-known bias in skin lesion classification and three other datasets from the domain shift literature. We find that our approach works especially well with very few images. Simple fine-tuning combined with our early stopping also leads to performance benefits for a larger number of tuning samples.
