Table of Contents
Fetching ...

Towards Utilising a Range of Neural Activations for Comprehending Representational Associations

Laura O'Mahony, Nikola S. Nikolov, David JP O'Sullivan

TL;DR

A method to curate data from midrange logit samples for retraining to mitigate spurious correlations, or confounding concepts in the penultimate layer, on real benchmark datasets is developed and exemplifies the utility of inspecting non-maximal activations to extract complex relationships learned by models.

Abstract

Recent efforts to understand intermediate representations in deep neural networks have commonly attempted to label individual neurons and combinations of neurons that make up linear directions in the latent space by examining extremal neuron activations and the highest direction projections. In this paper, we show that this approach, although yielding a good approximation for many purposes, fails to capture valuable information about the behaviour of a representation. Neural network activations are generally dense, and so a more complex, but realistic scenario is that linear directions encode information at various levels of stimulation. We hypothesise that non-extremal level activations contain complex information worth investigating, such as statistical associations, and thus may be used to locate confounding human interpretable concepts. We explore the value of studying a range of neuron activations by taking the case of mid-level output neuron activations and demonstrate on a synthetic dataset how they can inform us about aspects of representations in the penultimate layer not evident through analysing maximal activations alone. We use our findings to develop a method to curate data from mid-range logit samples for retraining to mitigate spurious correlations, or confounding concepts in the penultimate layer, on real benchmark datasets. The success of our method exemplifies the utility of inspecting non-maximal activations to extract complex relationships learned by models.

Towards Utilising a Range of Neural Activations for Comprehending Representational Associations

TL;DR

A method to curate data from midrange logit samples for retraining to mitigate spurious correlations, or confounding concepts in the penultimate layer, on real benchmark datasets is developed and exemplifies the utility of inspecting non-maximal activations to extract complex relationships learned by models.

Abstract

Recent efforts to understand intermediate representations in deep neural networks have commonly attempted to label individual neurons and combinations of neurons that make up linear directions in the latent space by examining extremal neuron activations and the highest direction projections. In this paper, we show that this approach, although yielding a good approximation for many purposes, fails to capture valuable information about the behaviour of a representation. Neural network activations are generally dense, and so a more complex, but realistic scenario is that linear directions encode information at various levels of stimulation. We hypothesise that non-extremal level activations contain complex information worth investigating, such as statistical associations, and thus may be used to locate confounding human interpretable concepts. We explore the value of studying a range of neuron activations by taking the case of mid-level output neuron activations and demonstrate on a synthetic dataset how they can inform us about aspects of representations in the penultimate layer not evident through analysing maximal activations alone. We use our findings to develop a method to curate data from mid-range logit samples for retraining to mitigate spurious correlations, or confounding concepts in the penultimate layer, on real benchmark datasets. The success of our method exemplifies the utility of inspecting non-maximal activations to extract complex relationships learned by models.

Paper Structure

This paper contains 31 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: An illustration of the spurious correlation data analysis and automatic selection for retraining to mitigate the found spurious bias in the CelebA dataset. 1. A model is trained using the standard empirical risk minimisation (ERM) approach, which typically absorbs spurious correlations. 2. We select an output class and order the logits by magnitude. Samples within the mid-level range where the model has lower prediction confidence in its prediction may include many mislabels, low spurious images, and counterexamples to the spurious trend. We filter the data to keep only mid-range activating examples. 3. We cluster the corresponding embeddings from the penultimate layer. 4. In this way, less spurious data is selected to retrain the model, such as by simply fine-tuning the classification layer.
  • Figure 2: (a) Train and test classification accuracy results on DSprites dataset for classifiers trained with varying proportions of bias added (shown on the $x$-axis). (b) The representation similarity matrix for the penultimate layer for bias level 0.4 sorted by shape (left) and $x$ position (right).
  • Figure 3: (a) Maximally activating examples for the neuron corresponding to the square class for training bias level 0.4. (b) Mid-level activating images in the same setting.
  • Figure 4: UMAP projections for encoder embeddings of (a) all points and (b) middle logit points from the CelebA dataset.
  • Figure 5: Representational similarity matrices plotted for encoders trained with varying levels of bias (below each image). The order of encoder embeddings is sorted by (a) shape, and (b) the $x$ position of the shape.
  • ...and 6 more figures