Concept Gradient: Concept-based Interpretation Without Linear Assumption
Andrew Bai, Chih-Kuan Yeh, Pradeep Ravikumar, Neil Y. C. Lin, Cho-Jui Hsieh
TL;DR
Concept Gradient (CG) extends concept-based interpretation beyond the linear assumption of Concept Activation Vector (CAV) by deriving a gradient-based sensitivity of the model output to concept functions $g: \mathbb{R}^d \to \mathbb{R}^m$ via $R_{\text{CG}}(x; f, g) = \nabla g(x)^{\dagger} \nabla f(x)$. CG unifies and generalizes CAV and GC by explicitly chaining gradients through the shared input space, recovering the derivative $h'(c)$ when a local inverse exists, and reducing to CAV in the linear case. Empirically, CG outperforms CAV on fine-grained image datasets (CUB, AwA2) in both local and global concept attribution and yields qualitatively coherent explanations across semantic levels, including a medical case study on mortality risk that aligns with literature. The work also provides practical guidance for implementing CG, including concept model training via finetuning, layer selection strategies, and normalization considerations, while acknowledging limitations related to differentiability and the need for representative concept data. Overall, CG offers a principled, non-linear, gradient-based framework for post-hoc concept explanations with demonstrated benefits for trust, debugging, and domain-specific decision support.
Abstract
Concept-based interpretations of black-box models are often more intuitive for humans to understand. The most widely adopted approach for concept-based interpretation is Concept Activation Vector (CAV). CAV relies on learning a linear relation between some latent representation of a given model and concepts. The linear separability is usually implicitly assumed but does not hold true in general. In this work, we started from the original intent of concept-based interpretation and proposed Concept Gradient (CG), extending concept-based interpretation beyond linear concept functions. We showed that for a general (potentially non-linear) concept, we can mathematically evaluate how a small change of concept affecting the model's prediction, which leads to an extension of gradient-based interpretation to the concept space. We demonstrated empirically that CG outperforms CAV in both toy examples and real world datasets.
