Table of Contents
Fetching ...

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, Noah D. Goodman

TL;DR

The paper introduces Distributed Alignment Search (DAS), a gradient‑based method to align interpretable high‑level causal variables with distributed neural representations, addressing limitations of brute‑force searches and disjoint localist mappings. By employing distributed interchange interventions in non‑standard bases and learning rotation governors, DAS reveals faithful causal abstractions between neural nets and symbolic models across relational and lexical tasks. Across Hierarchical Equality and Monotonicity NLI, DAS achieves perfect or near‑perfect interchange‑intervention accuracy, while also uncovering whether representations truly encode abstract relations or data structures like word identities. The work establishes DAS as a scalable, interpretable tool for causal analysis of deep nets and highlights the nuanced substructure of neural representations, with implications for explainability and the understanding of symbolic–connectionist coexistence.

Abstract

Causal abstraction is a promising theoretical framework for explainable artificial intelligence that defines when an interpretable high-level causal model is a faithful simplification of a low-level deep learning system. However, existing causal abstraction methods have two major limitations: they require a brute-force search over alignments between the high-level model and the low-level one, and they presuppose that variables in the high-level model will align with disjoint sets of neurons in the low-level one. In this paper, we present distributed alignment search (DAS), which overcomes these limitations. In DAS, we find the alignment between high-level and low-level models using gradient descent rather than conducting a brute-force search, and we allow individual neurons to play multiple distinct roles by analyzing representations in non-standard bases-distributed representations. Our experiments show that DAS can discover internal structure that prior approaches miss. Overall, DAS removes previous obstacles to conducting causal abstraction analyses and allows us to find conceptual structure in trained neural nets.

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

TL;DR

The paper introduces Distributed Alignment Search (DAS), a gradient‑based method to align interpretable high‑level causal variables with distributed neural representations, addressing limitations of brute‑force searches and disjoint localist mappings. By employing distributed interchange interventions in non‑standard bases and learning rotation governors, DAS reveals faithful causal abstractions between neural nets and symbolic models across relational and lexical tasks. Across Hierarchical Equality and Monotonicity NLI, DAS achieves perfect or near‑perfect interchange‑intervention accuracy, while also uncovering whether representations truly encode abstract relations or data structures like word identities. The work establishes DAS as a scalable, interpretable tool for causal analysis of deep nets and highlights the nuanced substructure of neural representations, with implications for explainability and the understanding of symbolic–connectionist coexistence.

Abstract

Causal abstraction is a promising theoretical framework for explainable artificial intelligence that defines when an interpretable high-level causal model is a faithful simplification of a low-level deep learning system. However, existing causal abstraction methods have two major limitations: they require a brute-force search over alignments between the high-level model and the low-level one, and they presuppose that variables in the high-level model will align with disjoint sets of neurons in the low-level one. In this paper, we present distributed alignment search (DAS), which overcomes these limitations. In DAS, we find the alignment between high-level and low-level models using gradient descent rather than conducting a brute-force search, and we allow individual neurons to play multiple distinct roles by analyzing representations in non-standard bases-distributed representations. Our experiments show that DAS can discover internal structure that prior approaches miss. Overall, DAS removes previous obstacles to conducting causal abstraction analyses and allows us to find conceptual structure in trained neural nets.
Paper Structure (34 sections, 10 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 10 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: A generic multi-source distributed interchange intervention. The base input and two source inputs create three total settings of a model. The top left (green) and right (blue) total model settings are determined by two source inputs and the middle total model setting (red) is determined by the base input. Three hidden units from each total setting are rotated with an orthogonal matrix $\mathbf{R}:\mathbf{X}\to\mathbf{Y}$. Then we intervene on the rotated representation for the base input and fix two dimensions to be the value they take on for each source input, respectively. Then we unrotate the representation with $\mathbf{R}^{-1}$ and compute a counterfactual total model setting for the base input. In DAS, the orthogonal matrix is found with gradient descent using a high-level causal model to guide the search process.
  • Figure 2: A causal model that computes the hierarchical equality task.
  • Figure 3: DAS on a random network with a 16 dimension input. An oversized hidden dimension allows DAS to manipulate the model behavior by searching through a large space of random mechanisms.
  • Figure 4: Monotonicity NLI task examples and high-level model.
  • Figure 5: Finding Localist Alignment Matrix
  • ...and 7 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5