Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models

Goutham Rajendran; Simon Buchholz; Bryon Aragam; Bernhard Schölkopf; Pradeep Ravikumar

Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models

Goutham Rajendran, Simon Buchholz, Bryon Aragam, Bernhard Schölkopf, Pradeep Ravikumar

TL;DR

The paper addresses how to learn human-interpretable concepts from complex data by unifying causal representation learning (CRL) with foundation-model interpretability. It defines concepts as affine subspaces in a latent representation and proves identifiability of a subset of concepts from a small number of environments, specifically requiring only $m= n+1$ concept-conditioned datasets (i.e., $n+2$ environments) to recover $n$ atomic concepts up to linear transformations. The authors validate the theory with end-to-end contrastive learning on synthetic data and extend the framework to large-language-model alignment, introducing steering matrices to guide truthfulness in Inference-Time Intervention (ITI) and demonstrating improvements on TruthfulQA with LLaMA. This work provides a principled partial identifiability framework for interpretable representations in high-dimensional data and offers practical mechanisms for controllable generation and mechanistic interpretability of foundation models.

Abstract

To build intelligent machine learning systems, there are two broad approaches. One approach is to build inherently interpretable models, as endeavored by the growing field of causal representation learning. The other approach is to build highly-performant foundation models and then invest efforts into understanding how they work. In this work, we relate these two approaches and study how to learn human-interpretable concepts from data. Weaving together ideas from both fields, we formally define a notion of concepts and show that they can be provably recovered from diverse data. Experiments on synthetic data and large language models show the utility of our unified approach.

Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models

TL;DR

concept-conditioned datasets (i.e.,

environments) to recover

atomic concepts up to linear transformations. The authors validate the theory with end-to-end contrastive learning on synthetic data and extend the framework to large-language-model alignment, introducing steering matrices to guide truthfulness in Inference-Time Intervention (ITI) and demonstrating improvements on TruthfulQA with LLaMA. This work provides a principled partial identifiability framework for interpretable representations in high-dimensional data and offers practical mechanisms for controllable generation and mechanistic interpretability of foundation models.

Abstract

Paper Structure (45 sections, 9 theorems, 95 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 45 sections, 9 theorems, 95 equations, 1 figure, 2 tables, 1 algorithm.

Introduction
Related work
Causal representation learning
Linearity of representations
Concepts from pre-trained models
Setup
Generative model
Data distributions
Main Result
Experiments
End-to-end Contrastive learning algorithm
Sampling from concept conditional distributions
Synthetic experiments
Alignment of Large Language Models
Conclusion
...and 30 more sections

Key Result

Lemma 1

Assumption ass:div is satisfied almost-surely if there are $n+1$ concept conditional distributions such that every $n$ rows of the environment-concept matrix are linearly independent and the $b^e$ are drawn independently according to a continuous distribution.

Figures (1)

Figure 1: Illustration of our framework

Theorems & Definitions (24)

Definition 1: Concepts
Definition 2: Atoms
Definition 3: Concept conditional distribution
Definition 4: Identifiability
Remark 1
Lemma 1
Theorem 1
Remark 2
Lemma 2
Theorem 2
...and 14 more

Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models

TL;DR

Abstract

Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (24)