Concept Gradient: Concept-based Interpretation Without Linear Assumption

Andrew Bai; Chih-Kuan Yeh; Pradeep Ravikumar; Neil Y. C. Lin; Cho-Jui Hsieh

Concept Gradient: Concept-based Interpretation Without Linear Assumption

Andrew Bai, Chih-Kuan Yeh, Pradeep Ravikumar, Neil Y. C. Lin, Cho-Jui Hsieh

TL;DR

Concept Gradient (CG) extends concept-based interpretation beyond the linear assumption of Concept Activation Vector (CAV) by deriving a gradient-based sensitivity of the model output to concept functions $g: \mathbb{R}^d \to \mathbb{R}^m$ via $R_{\text{CG}}(x; f, g) = \nabla g(x)^{\dagger} \nabla f(x)$. CG unifies and generalizes CAV and GC by explicitly chaining gradients through the shared input space, recovering the derivative $h'(c)$ when a local inverse exists, and reducing to CAV in the linear case. Empirically, CG outperforms CAV on fine-grained image datasets (CUB, AwA2) in both local and global concept attribution and yields qualitatively coherent explanations across semantic levels, including a medical case study on mortality risk that aligns with literature. The work also provides practical guidance for implementing CG, including concept model training via finetuning, layer selection strategies, and normalization considerations, while acknowledging limitations related to differentiability and the need for representative concept data. Overall, CG offers a principled, non-linear, gradient-based framework for post-hoc concept explanations with demonstrated benefits for trust, debugging, and domain-specific decision support.

Abstract

Concept-based interpretations of black-box models are often more intuitive for humans to understand. The most widely adopted approach for concept-based interpretation is Concept Activation Vector (CAV). CAV relies on learning a linear relation between some latent representation of a given model and concepts. The linear separability is usually implicitly assumed but does not hold true in general. In this work, we started from the original intent of concept-based interpretation and proposed Concept Gradient (CG), extending concept-based interpretation beyond linear concept functions. We showed that for a general (potentially non-linear) concept, we can mathematically evaluate how a small change of concept affecting the model's prediction, which leads to an extension of gradient-based interpretation to the concept space. We demonstrated empirically that CG outperforms CAV in both toy examples and real world datasets.

Concept Gradient: Concept-based Interpretation Without Linear Assumption

TL;DR

via

. CG unifies and generalizes CAV and GC by explicitly chaining gradients through the shared input space, recovering the derivative

when a local inverse exists, and reducing to CAV in the linear case. Empirically, CG outperforms CAV on fine-grained image datasets (CUB, AwA2) in both local and global concept attribution and yields qualitatively coherent explanations across semantic levels, including a medical case study on mortality risk that aligns with literature. The work also provides practical guidance for implementing CG, including concept model training via finetuning, layer selection strategies, and normalization considerations, while acknowledging limitations related to differentiability and the need for representative concept data. Overall, CG offers a principled, non-linear, gradient-based framework for post-hoc concept explanations with demonstrated benefits for trust, debugging, and domain-specific decision support.

Abstract

Paper Structure (32 sections, 2 theorems, 38 equations, 7 figures, 10 tables)

This paper contains 32 sections, 2 theorems, 38 equations, 7 figures, 10 tables.

Introduction
Preliminaries
Problem definition
Recap of Concept Activation Vector (CAV)
GC: extending CAV to non-linear concepts
Proposed method
Definition of Concept Gradients (CG)
Implementation of CG
Selecting layer for attribution
Connections between CAV, GC, and CG
Experimental Results
Quantitative analysis
Qualitative analysis
Case study on mortality risk of myocardial infarction complications
Related work
...and 17 more sections

Key Result

Theorem 1

Consider a particular point $\hat{x}$ with $\hat{c}=g(\hat{x})$. Let $h: \mathbb{R}^m\rightarrow \mathbb{R}^d$ be a smooth and differentiable function mapping $c$ to $x$ and satisfy $g(h(c))=c$ locally within the $\epsilon$-ball around $\hat{c}$, then the gradient of $h$ will take the form of where any row vector of $G_\perp$ belongs to $\text{null}(\nabla g(x_0)^T)$ (null space of $\nabla g(x_0)

Figures (7)

Figure 1: Comparison of feature-based interpretation heatmap (left: Integrated Gradients) and concept-based importance score (right: Concept Gradients) for the model prediction of "Black footed Albatross". Attribution to high-level concepts is more informative to humans than raw pixels.
Figure 2: CUB concept recalls for different input representations in various layers and architectures (left to right, deep to shallow layers). CG consistently performs better than CAV locally and globally.
Figure 3: Visualization of instances with highest CG attributed importance (AwA2 validation set) for each concept (top 1 instance in the top 3 classes per concept). CG is capable of handling low level (colors), middle level (textures), and high level (body components) concepts simultaneously.
Figure 4: Concept prediction accuracy and concept attribution recall when finetuning starting from different layers of the model. Finetuning more layers leads to higher concept prediction accuracy.
Figure 5: Variance of gradients finetuning starting from different layers. The variance is higher when finetuning starting from earlier layers.
...and 2 more figures

Theorems & Definitions (3)

Theorem 1
Theorem 2
proof

Concept Gradient: Concept-based Interpretation Without Linear Assumption

TL;DR

Abstract

Concept Gradient: Concept-based Interpretation Without Linear Assumption

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (3)