Table of Contents
Fetching ...

Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability

Tuomas Oikarinen, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng

TL;DR

The paper tackles the costly and noisy process of crowdsourced evaluation for automated neuron explanations in vision models, proposing two techniques to scale reliability: Model-Guided Importance Sampling (MG-IS) and Bayes Rating Aggregation (BRAgg).MG-IS efficiently selects informative inputs for human labeling by guiding sampling with a model-predicted concept presence, while BRAgg robustly aggregates noisy human ratings using Bayesian reasoning with priors derived from cheap evaluators like SigLIP.Together, MG-IS and BRAgg reduce evaluation costs by about 40× and enable large-scale comparisons of interpretation methods across architectures such as ResNet-50 and ViT-B-16, revealing Linear Explanations (LE) as particularly effective.The study demonstrates that a hybrid human-model evaluation approach can rival fully automated assessments in accuracy while maintaining practicality and cost-effectiveness for mechanistic interpretability research.

Abstract

Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowd-sourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results. In this paper, we introduce two techniques to enable cost-effective and accurate crowdsourced evaluation of automated interpretability methods beyond top activating inputs. First, we propose Model-Guided Importance Sampling (MG-IS) to select the most informative inputs to show human raters. In our experiments, we show this reduces the number of inputs needed to reach the same evaluation accuracy by ~13x. Second, we address label noise in crowd-sourced ratings through Bayesian Rating Aggregation (BRAgg), which allows us to reduce the number of ratings per input required to overcome noise by ~3x. Together, these techniques reduce the evaluation cost by ~40x, making large-scale evaluation feasible. Finally, we use our methods to conduct a large scale crowd-sourced study comparing recent automated interpretability methods for vision networks.

Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability

TL;DR

The paper tackles the costly and noisy process of crowdsourced evaluation for automated neuron explanations in vision models, proposing two techniques to scale reliability: Model-Guided Importance Sampling (MG-IS) and Bayes Rating Aggregation (BRAgg).MG-IS efficiently selects informative inputs for human labeling by guiding sampling with a model-predicted concept presence, while BRAgg robustly aggregates noisy human ratings using Bayesian reasoning with priors derived from cheap evaluators like SigLIP.Together, MG-IS and BRAgg reduce evaluation costs by about 40× and enable large-scale comparisons of interpretation methods across architectures such as ResNet-50 and ViT-B-16, revealing Linear Explanations (LE) as particularly effective.The study demonstrates that a hybrid human-model evaluation approach can rival fully automated assessments in accuracy while maintaining practicality and cost-effectiveness for mechanistic interpretability research.

Abstract

Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowd-sourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results. In this paper, we introduce two techniques to enable cost-effective and accurate crowdsourced evaluation of automated interpretability methods beyond top activating inputs. First, we propose Model-Guided Importance Sampling (MG-IS) to select the most informative inputs to show human raters. In our experiments, we show this reduces the number of inputs needed to reach the same evaluation accuracy by ~13x. Second, we address label noise in crowd-sourced ratings through Bayesian Rating Aggregation (BRAgg), which allows us to reduce the number of ratings per input required to overcome noise by ~3x. Together, these techniques reduce the evaluation cost by ~40x, making large-scale evaluation feasible. Finally, we use our methods to conduct a large scale crowd-sourced study comparing recent automated interpretability methods for vision networks.

Paper Structure

This paper contains 44 sections, 1 theorem, 34 equations, 17 figures, 7 tables.

Key Result

Theorem 1

For importance sampling with sampling distribution $q$: The choice of $q$ that minimizes the variance satisfies $q(x) \propto |h(x)|p(x).$

Figures (17)

  • Figure 1: Overview of the explanation evaluation pipeline. We focus on two main challenges: 1: How to reduce high labeling cost? and 2: How to effectively handle noisy ratings?. Our proposed solutions are discussed in Section \ref{['sec:method']}, validated in Section \ref{['sec:validation']}, and applied to compare automated interpretability methods in Section \ref{['sec:large_scale_exp']}.
  • Figure 2: Comparing different sampling strategies to estimate $\rho_{\mathcal{S}}$. Our Model-Guided Importance Sampling (MG-IS) using SigLIP estimates significantly outperform baseline regardless of the size of sampled subset.
  • Figure 3: Comparison of different rating aggregation strategies on simulated human study with error rate $\eta=23\%$, tested on the full dataset.
  • Figure 4: Comparing the effect of our sampling technique MG-IS, rating aggregation method BRAgg independently and both together. The shaded region represents RCE$\leq 25\%$.
  • Figure 5: Results of our MTurk study. Using BRAgg(SigLIP) for rating aggregation. All methods are restricted to length 1 explanations. Overall explanations of LE(SigLIP) have the highest correlation coefficient.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Theorem 1: montecarlobook, Sec 3.3.2, Theorem 3.12