Table of Contents
Fetching ...

SCoCCA: Multi-modal Sparse Concept Decomposition via Canonical Correlation Analysis

Ehud Gordon, Meir Yossef Levi, Guy Gilboa

Abstract

Interpreting the internal reasoning of vision-language models is essential for deploying AI in safety-critical domains. Concept-based explainability provides a human-aligned lens by representing a model's behavior through semantically meaningful components. However, existing methods are largely restricted to images and overlook the cross-modal interactions. Text-image embeddings, such as those produced by CLIP, suffer from a modality gap, where visual and textual features follow distinct distributions, limiting interpretability. Canonical Correlation Analysis (CCA) offers a principled way to align features from different distributions, but has not been leveraged for multi-modal concept-level analysis. We show that the objectives of CCA and InfoNCE are closely related, such that optimizing CCA implicitly optimizes InfoNCE, providing a simple, training-free mechanism to enhance cross-modal alignment without affecting the pre-trained InfoNCE objective. Motivated by this observation, we couple concept-based explainability with CCA, introducing Concept CCA (CoCCA), a framework that aligns cross-modal embeddings while enabling interpretable concept decomposition. We further extend it and propose Sparse Concept CCA (SCoCCA), which enforces sparsity to produce more disentangled and discriminative concepts, facilitating improved activation, ablation, and semantic manipulation. Our approach generalizes concept-based explanations to multi-modal embeddings and achieves state-of-the-art performance in concept discovery, evidenced by reconstruction and manipulation tasks such as concept ablation.

SCoCCA: Multi-modal Sparse Concept Decomposition via Canonical Correlation Analysis

Abstract

Interpreting the internal reasoning of vision-language models is essential for deploying AI in safety-critical domains. Concept-based explainability provides a human-aligned lens by representing a model's behavior through semantically meaningful components. However, existing methods are largely restricted to images and overlook the cross-modal interactions. Text-image embeddings, such as those produced by CLIP, suffer from a modality gap, where visual and textual features follow distinct distributions, limiting interpretability. Canonical Correlation Analysis (CCA) offers a principled way to align features from different distributions, but has not been leveraged for multi-modal concept-level analysis. We show that the objectives of CCA and InfoNCE are closely related, such that optimizing CCA implicitly optimizes InfoNCE, providing a simple, training-free mechanism to enhance cross-modal alignment without affecting the pre-trained InfoNCE objective. Motivated by this observation, we couple concept-based explainability with CCA, introducing Concept CCA (CoCCA), a framework that aligns cross-modal embeddings while enabling interpretable concept decomposition. We further extend it and propose Sparse Concept CCA (SCoCCA), which enforces sparsity to produce more disentangled and discriminative concepts, facilitating improved activation, ablation, and semantic manipulation. Our approach generalizes concept-based explanations to multi-modal embeddings and achieves state-of-the-art performance in concept discovery, evidenced by reconstruction and manipulation tasks such as concept ablation.
Paper Structure (46 sections, 42 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 46 sections, 42 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Concept Swapping. Beyond explainability, concept decomposition enables controllable manipulation. Using SCoCCA, an embedding can be decomposed into interpretable concepts (e.g., cube and cylinder), their magnitudes swapped, and the modified embedding recomposed to synthesize an image reflecting the swapped concepts.
  • Figure 2: Method Overview. In the Concept Discovery phase, text and image embeddings are aligned via Canonical Correlation Analysis (CCA) to form a shared latent space. The Hungarian algorithm establishes a one-to-one correspondence between each concept vector and its most relevant item in the concept bank. In the Concept Decomposition phase, new embedding is decomposed into concepts by solving a Lasso optimization using the matrix $\mathbf{C}$ and the discovered associations.
  • Figure 3: Concept Retrieval Generalization. Retrieval of the top four MSCOCO lin2014microsoft images with the highest activation for the concepts Microwave and Traffic Light, where the concept bank was calibrated on ImageNet only (images are shown in descending order). While SCoCCA retrieves images with a clear presence of the concept, other methods in some cases return images with little or no evidence of the desired concept.
  • Figure 4: $\lambda$ ablation for SCoCCA on ImageNet-500. Top: Zero-shot accuracy on ImageNet-500 as a function the $\lambda$ used when solving the sparse CoCCA coding objective in \ref{['eq:lasso']}. Bottom: $\|w\|_0$ normalized by $k$, as function of $\lambda$. In both plots, $\lambda=0$ corresponds to the CoCCA baseline without the sparsity penalty. The curves shows that forcing sparsity yields substantially higher zero-shot accuracy than the non-sparse baseline.
  • Figure 5: ablation of $k$ hyperparameter Zero-Shot Accuracy performance on test set of ImageNet-500, as a function of $k$, the number of concepts computed, for the SCoCCA method. As more concepts allows better separation, this is a monotone increasing function. However the rate of steepness decreases, as additional concepts yield less of a return in classification.