Table of Contents
Fetching ...

Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

Roland S. Zimmermann, Thomas Klein, Wieland Brendel

TL;DR

The paper investigates whether scaling vision models in size and data improves mechanistic interpretability at the level of individual units. Using a large-scale psychophysical 2-AFC protocol across nine diverse architectures and two interpretability methods (natural exemplars and synthetic feature visualizations), the authors find no meaningful gains in interpretability from scaling, and in some cases observe lower interpretability for modern models compared to GoogLeNet. They introduce the IMI dataset, consisting of over 130,000 human responses across 767 units, to enable automated, human-aligned interpretability measures and future optimization. The findings argue that interpretability must be explicitly designed into model architectures and training objectives, rather than emerging as a byproduct of scale, and provide a resource to accelerate the development of automated interpretability tools with broad practical impact.

Abstract

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We use a psychophysical paradigm to quantify one form of mechanistic interpretability for a diverse suite of nine models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 130'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset facilitates research on automated instead of human-based interpretability evaluations, which can ultimately be leveraged to directly optimize the mechanistic interpretability of models.

Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

TL;DR

The paper investigates whether scaling vision models in size and data improves mechanistic interpretability at the level of individual units. Using a large-scale psychophysical 2-AFC protocol across nine diverse architectures and two interpretability methods (natural exemplars and synthetic feature visualizations), the authors find no meaningful gains in interpretability from scaling, and in some cases observe lower interpretability for modern models compared to GoogLeNet. They introduce the IMI dataset, consisting of over 130,000 human responses across 767 units, to enable automated, human-aligned interpretability measures and future optimization. The findings argue that interpretability must be explicitly designed into model architectures and training objectives, rather than emerging as a byproduct of scale, and provide a resource to accelerate the development of automated interpretability tools with broad practical impact.

Abstract

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We use a psychophysical paradigm to quantify one form of mechanistic interpretability for a diverse suite of nine models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 130'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset facilitates research on automated instead of human-based interpretability evaluations, which can ultimately be leveraged to directly optimize the mechanistic interpretability of models.
Paper Structure (33 sections, 22 figures)

This paper contains 33 sections, 22 figures.

Figures (22)

  • Figure 1: Has scaling models in terms of their dataset and model size improved interpretability?A. We perform a large-scale psychophysics experiment to investigate the interpretability of nine networks through the two most-used mechanistic interpretability methods. B. We see that scaling has not led to increased interpretability. Therefore, we argue that one has to explicitly optimize models to be interpretable. C. We expect our dataset to enable building automated measures for quantifying the interpretability of models and, thus, bootstrap the development of more interpretable models.
  • Figure 2: Left. Model size and training schemes have little influence on per-unit mechanistic interpretability. We compare the mechanistic interpretability of the units of nine vision models for two interpretability methods: maximally activating dataset samples (Natural) and feature visualizations (Synthetic). In a large-scale psychophysical experiment, we compare models that differ in architecture, training objectives, and training data. While these models reflect the advancements in model design in recent years (sorted by model size first and then dataset size), we surprisingly see little to no effect of these design choices on mechanistic, per-unit interpretability. While these results might appear promising as all models yield scores of about $80$ % (natural), note that we demonstrate that interpretability is far more limited than it first appears and breaks down dramatically as the task is made harder in \ref{['sec:results_difficulty']}. Also, note that error bars represent confidence intervals around the estimated means, not variance of the underlying data (see also \ref{['sec:results_imi']}). Right. Few models have significantly different interpretability scores. The differences across models in interpretability afforded by natural exemplars are mostly non-significant (NS) in a Conover test with Holm correction for multiple comparisons; see \ref{['fig:appx_model_comparison_significance_optimized']} for significance values for synthetic feature visualizations.
  • Figure 3: Neither higher classification performance nor more human-like decisions come with higher interpretability.Left. While the investigated models have strongly varying classification performance, as measured by the ImageNet validation accuracy, their interpretability shows less variation for both natural exemplars (orange) and synthetic feature visualizations (blue). More accurate classifiers are not necessarily more interpretable. For synthetic feature visualizations, there might even be a regression of interpretability with increasing accuracy. Right. A similar result is obtained when quantifying the similarity models have to human behavior. This similarity is measured by the mean rank statistic of the model-vs-human benchmark geirhos2021partial, with a lower rank meaning that the model is more human-like.
  • Figure 4: The position of a layer is sometimes predictive of its interpretability. We investigate the interpretability afforded by natural exemplars as measured in our psychophysical experiment by visualizing it for different units of various layers for all investigated networks as a function of their relative position within the network. Here, the first layer corresponds to a relative position of $0$, whereas the last layer has a position of $1$. The table shows Spearman's rank correlation between the proportion correct (averaged over multiple units from the same layer) and the layer position. Asterisks denote significant correlations using the thresholds shown in \ref{['fig:model_comparison']}.
  • Figure 5: Well-interpretable units do not necessarily stay interpretable in harder tasks. We visualize the human performance for each unit investigated of the (Clip) ResNet-50 for the easy (black), medium (blue), and hard (orange) tasks in the natural condition. The units are ordered by the recorded proportion correct values in the easy task. As expected, the performance for almost all units decreases with increasing hardness. However, how much the performance drops is not strongly correlated with performance in the easy task, i.e., well-interpretable units in the easy condition do not necessarily stay well-interpretable in the harder task. For an alternative visualization that displays the gap between the difficulty levels separately, see \ref{['fig:model_comparison_units_rn_easy_vs_hard_gaps']}.
  • ...and 17 more figures