Table of Contents
Fetching ...

Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations

Eren Erogullari, Sebastian Lapuschkin, Wojciech Samek, Frederik Pahde

TL;DR

This work tackles entanglement among multiple Concept Activation Vectors (CAVs) by introducing a post-hoc orthogonalization framework that adds a non-orthogonality penalty to the standard CAV objective. The proposed $\mathcal{L}_{\text{orth}}$, optionally weighted as $\mathcal{L}_{\text{orth}}^{\beta}$, promotes orthogonality among concept directions while preserving directional correctness, enabling isolated concept manipulation in activation steering. Across CelebA and the synthetic FunnyBirds dataset with VGG16 and ResNet18, the method yields near-perfect disentanglement (high $\bar O$) with minimal AUROC loss, and qualitative heatmaps confirm improved concept isolation. Applications include precise concept insertion and targeted removal in diffusion-based generative models, reducing collateral damage compared to baseline CAVs, with implications for interpretable and robust concept-based explanations.

Abstract

Concept Activation Vectors (CAVs) are widely used to model human-understandable concepts as directions within the latent space of neural networks. They are trained by identifying directions from the activations of concept samples to those of non-concept samples. However, this method often produces similar, non-orthogonal directions for correlated concepts, such as "beard" and "necktie" within the CelebA dataset, which frequently co-occur in images of men. This entanglement complicates the interpretation of concepts in isolation and can lead to undesired effects in CAV applications, such as activation steering. To address this issue, we introduce a post-hoc concept disentanglement method that employs a non-orthogonality loss, facilitating the identification of orthogonal concept directions while preserving directional correctness. We evaluate our approach with real-world and controlled correlated concepts in CelebA and a synthetic FunnyBirds dataset with VGG16 and ResNet18 architectures. We further demonstrate the superiority of orthogonalized concept representations in activation steering tasks, allowing (1) the insertion of isolated concepts into input images through generative models and (2) the removal of concepts for effective shortcut suppression with reduced impact on correlated concepts in comparison to baseline CAVs.

Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations

TL;DR

This work tackles entanglement among multiple Concept Activation Vectors (CAVs) by introducing a post-hoc orthogonalization framework that adds a non-orthogonality penalty to the standard CAV objective. The proposed , optionally weighted as , promotes orthogonality among concept directions while preserving directional correctness, enabling isolated concept manipulation in activation steering. Across CelebA and the synthetic FunnyBirds dataset with VGG16 and ResNet18, the method yields near-perfect disentanglement (high ) with minimal AUROC loss, and qualitative heatmaps confirm improved concept isolation. Applications include precise concept insertion and targeted removal in diffusion-based generative models, reducing collateral damage compared to baseline CAVs, with implications for interpretable and robust concept-based explanations.

Abstract

Concept Activation Vectors (CAVs) are widely used to model human-understandable concepts as directions within the latent space of neural networks. They are trained by identifying directions from the activations of concept samples to those of non-concept samples. However, this method often produces similar, non-orthogonal directions for correlated concepts, such as "beard" and "necktie" within the CelebA dataset, which frequently co-occur in images of men. This entanglement complicates the interpretation of concepts in isolation and can lead to undesired effects in CAV applications, such as activation steering. To address this issue, we introduce a post-hoc concept disentanglement method that employs a non-orthogonality loss, facilitating the identification of orthogonal concept directions while preserving directional correctness. We evaluate our approach with real-world and controlled correlated concepts in CelebA and a synthetic FunnyBirds dataset with VGG16 and ResNet18 architectures. We further demonstrate the superiority of orthogonalized concept representations in activation steering tasks, allowing (1) the insertion of isolated concepts into input images through generative models and (2) the removal of concepts for effective shortcut suppression with reduced impact on correlated concepts in comparison to baseline CAVs.

Paper Structure

This paper contains 24 sections, 8 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Left: Our novel objective encourages the orthogonalization of multiple concept directions trained simultaneously. Right: The resulting disentangled are beneficial for various applications, as concepts can be targeted in isolation. For example, when inserting the "necktie" concept to an input image in a steering task, the usage of entangled baseline might add correlated concepts as well (e.g., "mustache"), while disentangled add the targeted concept in isolation.
  • Figure 2: Left: Correlations of known concepts based on their co-occurrence in CelebA. Right: Pair-wise cosine similarities between concept representations via trained in isolation. Concepts frequently co-occurring in the training data (e.g., "high cheekbones", "smiling", and "mouth slightly open") result in highly similar and entangled .
  • Figure 3: Evolution of AUROC (blue) and average orthogonality $\bar{O}$ (red) during optimization for ResNet18 and VGG16 models trained on CelebA (left) and FunnyBirds (right). Our approach achieves near-perfect orthogonalization, while preserving directional correctness as measured via AUROC.
  • Figure 4: Left: Distribution of relative change before and after optimization for the VGG16 model for CelebA. Middle and Right: Evolution of per-concept metrics orthogonality $O_i$ and . We highlight concepts with the highest increase (green), the highest decrease (red), and the smallest change in (blue).
  • Figure 5: Left and Middle: Cosine similarity matrices of , trained on VGG16 model on CelebA dataset, before and after the fine-tuning. Right: Kernel Density Estimation of relative changes of individual concepts in two entangled blocks of female- (red) and male-associated (blue) concepts.
  • ...and 7 more figures