Table of Contents
Fetching ...

Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings

Grégoire Dhimoïla, Thomas Fel, Victor Boutin, Agustin Picard

TL;DR

This paper addresses the geometry of vision–language embeddings by proposing the Iso‑Energy Assumption, which states that genuinely shared concepts exhibit invariant average energy across image and text modalities. It operationalizes this principle with an Aligned Sparse Autoencoder (SAE‑A) that adds a soft energy‑alignment penalty to a sparse autoencoder, preserving reconstruction while biasing the dictionary toward bimodal, cross‑modal atoms. Empirically, SAE‑A reveals a two‑class atom structure where sparse bimodal atoms carry the cross‑modal alignment and unimodal atoms encode modality‑specific biases; removing unimodal atoms collapses the modality gap without harming retrieval, and restricting vector arithmetic to the bimodal subspace yields in‑distribution semantic edits. The work demonstrates that a principled inductive bias can both preserve model fidelity and render latent geometry interpretable and actionable, enabling targeted interventions such as gap closing and bimodal‑only semantic manipulation. Overall, Iso‑Energy offers a diagnostic and corrective framework for analyzing and controlling the geometry of multimodal embeddings with practical implications for retrieval, editing, and robustness in vision–language foundations.

Abstract

Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.

Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings

TL;DR

This paper addresses the geometry of vision–language embeddings by proposing the Iso‑Energy Assumption, which states that genuinely shared concepts exhibit invariant average energy across image and text modalities. It operationalizes this principle with an Aligned Sparse Autoencoder (SAE‑A) that adds a soft energy‑alignment penalty to a sparse autoencoder, preserving reconstruction while biasing the dictionary toward bimodal, cross‑modal atoms. Empirically, SAE‑A reveals a two‑class atom structure where sparse bimodal atoms carry the cross‑modal alignment and unimodal atoms encode modality‑specific biases; removing unimodal atoms collapses the modality gap without harming retrieval, and restricting vector arithmetic to the bimodal subspace yields in‑distribution semantic edits. The work demonstrates that a principled inductive bias can both preserve model fidelity and render latent geometry interpretable and actionable, enabling targeted interventions such as gap closing and bimodal‑only semantic manipulation. Overall, Iso‑Energy offers a diagnostic and corrective framework for analyzing and controlling the geometry of multimodal embeddings with practical implications for retrieval, editing, and robustness in vision–language foundations.

Abstract

Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.
Paper Structure (83 sections, 4 theorems, 20 equations, 27 figures, 12 tables, 1 algorithm)

This paper contains 83 sections, 4 theorems, 20 equations, 27 figures, 12 tables, 1 algorithm.

Key Result

Proposition 1

Consider $\bm{v} \in \mathbb{R}^d$ with decomposition $\bm{v} = \omega(\bm{x}) + \gamma(\bm{x})$ where $\omega(\bm{x}) \in \Omega$ encodes modality-specific information, $\gamma(\bm{x}) \in \Gamma$ captures cross-modal content, and $\mathbb{R}^d = \Omega \oplus \Gamma$. If visual and textual informa

Figures (27)

  • Figure 1: Multimodal data-generating process. A latent concept vector $\bm{c} \in \mathcal{C}$ (e.g., rabbit, forest, light, running) is sampled as a sparse combination of abstract concepts and rendered through domain-specific generators $\bm{g}(\cdot)$ (e.g., image or text). Dual-encoder models (e.g., CLIP, SigLIP) map these observations to a shared activation space, which sparse autoencoders (or other overcomplete dictionary learning methods) then attempt to lift back to concept-like atoms. However, without additional inductive bias, encoder-decoder pairs $(\bm{f}, \bm{\phi})$ are not uniquely determined, a well-known identifiability problem in nonlinear ICA. Here we leverage cross-modal redundancy as a useful inductive bias, nudging the solution toward recovering bimodal concepts.
  • Figure 2: (Left) Energy distribution across learned atoms. The majority of features are bimodal and medium-energy (inside diagonals defined by constant modality score $\mu$-\ref{['app:modality_metrics']}), while only a handful of high-energy unimodal features dominate modality-specific variance. These high-energy unimodal atoms behave like modality biases and are responsible for much of the observed modality gap. (Right) Geometric organization of concepts. Low-dimensional projections reveal three distinct clusters: image-only, text-only, and bimodal. Unimodal atoms align with the modality cones of the embedding space, while bimodal atoms occupy a modality-agnostic subspace orthogonal to these directions, thereby sustaining cross-modal alignment.
  • Figure 3: The modality gap arises from multiple unimodal concepts, while bimodal concepts are sufficient to sustain cross-modal alignment.Left: CLIP embeddings are re-expressed through a learned dictionary. A PCA projection highlights the separation between modalities, and a UMAP layout distinguishes two types of atoms: unimodal and bimodal. Right: Removing unimodal atoms with a binary mask $\bm{\delta} \in \{0,1\}^K$ closes the gap. The reconstructed embeddings $\widetilde{\bm{A}}$ continue to support retrieval, indicating that bimodal atoms alone capture the structure necessary for alignment.
  • Figure 4: Filtering unimodal atoms closes the modality gap without harming performance.(Left) Synthetic illustration comparing our method with the embedding shift baseline liang2022mind. Only our approach merges image and text distributions. (Right) Histogram of distances from each image (ID) and caption (OOD) embedding to its 10th nearest image neighbor. The modality gap is measured as the separation between the ID and OOD histograms. Filtering unimodal atoms aligns the two distributions, whereas shift degrades performance and leaves the gap wide open.
  • Figure 5: Semantic vector arithmetic restricted to cross-modal information. Starting from a Ruby (red stone), the target is a Sapphire (blue stone). The classical edit vector $\bm{\Delta} =$ Text $+$ Blue $-$ Red is polluted by unimodal directions, producing a query $\bm{Q} = \bm{I}_{\text{src}} + \bm{\Delta}$ that drifts out-of-distribution. In contrast, restricting to bimodal atoms yields $\bm{Q}_{\text{SAE}}$, which lies on the semantic manifold and reliably retrieves the correct target. This illustrates how unimodal features inject modality-specific bias into $\bm{\Delta}$, while Iso-Energy isolates the truly shared concepts that support valid semantic arithmetic.
  • ...and 22 more figures

Theorems & Definitions (11)

  • Definition 1: Multimodal Concept Generative Process
  • Definition 2: Iso-Energy Assumption
  • Proposition 1: Modality information removal impact on ranking.
  • proof
  • Proposition 2: Ranking invariance under constant projection
  • proof
  • Corollary 1: Approximate invariance under bounded spread
  • proof
  • Proposition 3: Characterization of ranking flips under adaptive modality
  • proof
  • ...and 1 more