Table of Contents
Fetching ...

GRADIEND: Feature Learning within Neural Networks Exemplified through Biases

Jonathan Drechsel, Steffen Herbold

TL;DR

This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion and demonstrates the effectiveness of this approach across various model architectures.

Abstract

AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.

GRADIEND: Feature Learning within Neural Networks Exemplified through Biases

TL;DR

This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion and demonstrates the effectiveness of this approach across various model architectures.

Abstract

AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.

Paper Structure

This paper contains 55 sections, 10 equations, 16 figures, 36 tables.

Figures (16)

  • Figure 1: gradae -- Targeted learning of a single scalar feature neuron using orthogonal gradient inputs, shown with an example for gender bias.
  • Figure 2: Distribution of encoded values for all gender Gradiend models across different datasets. The yellow dots indicate the expected label used for $\text{Cor}_\text{Enc}$.
  • Figure 3: Distribution of encoded values for different datasets of the $\text{BERT}_\text{base}$ models for race and religion. The yellow dots indicate the expected label used for $\text{Cor}_\text{Enc}$.
  • Figure 4: Metrics for changed models based on the $\text{BERT}_\text{base}$ gender Gradiend with varying feature factor and learning rate. The cells with the best bpi $\square$, fpi $\square$, and mpi $\square$ are highlighted across all subplots. All values are reported as percentages.
  • Figure 5: Distribution of encoded values for all race and religion Gradiend models across different datasets. The yellow dots indicate the expected label used for $\text{Cor}_\text{Enc}$.
  • ...and 11 more figures