Table of Contents
Fetching ...

Linear Spaces of Meanings: Compositional Structures in Vision-Language Models

Matthew Trager, Pramuditha Perera, Luca Zancato, Alessandro Achille, Parminder Bhatia, Stefano Soatto

TL;DR

This work reveals that contextual embeddings in vision-language models often admit a low-dimensional, linear factorization into 'ideal words' that summarize each component of a concept. By defining decomposable embeddings and a centered decomposition, the authors connect geometric structure to probabilistic independence in models like CLIP, and show how simple linear operations yield interpretable, controllable manipulations. Empirically, ideal-word decompositions improve compositional classification, debiasing, and retrieval, and visualization with diffusion models demonstrates practical, image-generating support for the approach. The findings offer a scalable framework for interpreting and regulating VLM behavior, with potential extensions to kernelized, cross-modal, and generative settings.

Abstract

We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within the embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language tasks such as classification, debiasing, and retrieval. Our results show that simple linear algebraic operations on embedding vectors can be used as compositional and interpretable methods for regulating the behavior of VLMs.

Linear Spaces of Meanings: Compositional Structures in Vision-Language Models

TL;DR

This work reveals that contextual embeddings in vision-language models often admit a low-dimensional, linear factorization into 'ideal words' that summarize each component of a concept. By defining decomposable embeddings and a centered decomposition, the authors connect geometric structure to probabilistic independence in models like CLIP, and show how simple linear operations yield interpretable, controllable manipulations. Empirically, ideal-word decompositions improve compositional classification, debiasing, and retrieval, and visualization with diffusion models demonstrates practical, image-generating support for the approach. The findings offer a scalable framework for interpreting and regulating VLM behavior, with potential extensions to kernelized, cross-modal, and generative settings.

Abstract

We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within the embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language tasks such as classification, debiasing, and retrieval. Our results show that simple linear algebraic operations on embedding vectors can be used as compositional and interpretable methods for regulating the behavior of VLMs.
Paper Structure (21 sections, 19 theorems, 51 equations, 9 figures, 7 tables)

This paper contains 21 sections, 19 theorems, 51 equations, 9 figures, 7 tables.

Key Result

Lemma 1

1) A collection of vectors $r(\mathcal{Z})$ is decomposable if and only if the vector difference ${\bm{u}}_{z} - {\bm{u}}_{z'}$ does not depend on the components that $z,z' \in \mathcal{Z}$ share in common. 2) If $|\mathcal{Z}_i| = n_i$, then the dimension of $Span(r(\mathcal{Z}))$ is at most $1+ \s

Figures (9)

  • Figure 1: Compositional structures in contextual embeddings. We show that the embeddings of composite concepts are often approximately decomposable as a sum of vectors corresponding to each factor. These vectors are not embeddings of actual words, but they can be viewed as "ideal words" and used for interpretable manipulations of the representations.
  • Figure 2: Visualization of embeddings.Top: projected embeddings of manually constructed strings associated with decomposable concepts. Bottom: projected embeddings for strings of the type "an image of a [a] [o]" for randomly chosen attributes and objects from MIT-states isolaDiscoveringStatesTransformations2015 and UTZappos yuFineGrainedVisualComparisons2014. Symmetric structures indicate that embeddings are approximately decomposable. See text for details.
  • Figure 3: Visualization of ideal words.First row: images generated by Stable Diffusion with the prompt "a photo of a green house." Because of the contextual encoder, "house" influences the meaning "green." Following rows: we compute ideal words approximations for strings of the form "a photo of a [color] $\times$ [object]," using five colors and four objects. In the second row, we generate images using the vector ${\bm{u}}_0 + {\bm{u}}_{\rm green} + {\bm{u}}_{\rm house}$. Now ${\bm{u}}_{\rm green}$ means green-colored because of how the string "green" composes with most objects. In the third row, we generate images using ${\bm{u}}_0 + {\bm{u}}_{\rm [color]} + {\bm{u}}_{\rm house}$ for different colors; in the fourth row, we use ${\bm{u}}_0 + {\bm{u}}_{\rm [color]} + {\bm{u}}_{\rm bike}$. The images were not cherry-picked or manipulated in any way. This example shows that we can generate embeddings of composite concepts by simply adding vectors in the representation space.
  • Figure 4: Projected embeddings of manually constructed strings associated with factored concepts, as described in Section \ref{['sec:exp']} in the main body of the paper. Top: trained encoder (same as in Figure \ref{['fig:3dplots']}). Bottom: visualization of the embeddings for the same strings using a randomly initialized encoder. Even without semantic information, the embeddings in the first three examples are still roughly decomposable.
  • Figure 5: Comparison between projected embeddings using a trained encoder (left figure in each pair) and using a randomly encoder (right figure in each pair). Both encoders lead to symmetric structures when the strings have a factored syntax (bottom row), while only the trained encoder shows these approximate structures when the factorization is semantic (top row).
  • ...and 4 more figures

Theorems & Definitions (33)

  • Definition 1: Decomposable embeddings
  • Lemma 1
  • Lemma 1: Centered decomposition
  • Proposition 1
  • Example 2
  • Proposition 2
  • Proposition 2
  • Corollary 2
  • Proposition 2: Relaxed feasibility of linear factorizations
  • Example 3
  • ...and 23 more