Table of Contents
Fetching ...

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov

Abstract

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

Abstract

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.
Paper Structure (27 sections, 31 equations, 10 figures, 10 tables, 3 algorithms)

This paper contains 27 sections, 31 equations, 10 figures, 10 tables, 3 algorithms.

Figures (10)

  • Figure 1: Overview of intra-modal retrieval with CLIP. (a) The standard approach simply compares the cosine similarities computed after applying projector $W_i$ to query and gallery image embeddings, which is sub-optimal due to intra-modal misalignment. (b) To circumvent misalignment, inversion approaches mistretta2025cross convert the query image embeddings to text embeddings by iteratively optimizing pseudo-tokens -- an expensive operation that incurs high latency -- and then computes inter-modal cosine similarities for retrieval. (c) We identify an inter-modal operator $\Psi=W_i^{\top} W_t$ fundamental to CLIP cosine similarity computations. We propose IsoCLIP, which uses only an isotropic region of the spectrum of $\Psi$ to align the projector weights along well-aligned directions between modalities. Then these aligned projectors are used to map the query and gallery embeddings. IsoCLIP exploits the properties of the CLIP projectors and does not add any latency to process while yielding more optimal intra-modal cosine similarities and significantly improved intra-modal performance.
  • Figure 2: Spectra of the inter-modal operator $\Psi=W_i^{\top} W_t$ for CLIP ViT-B/16 and ViT-B/32 with OpenAI and DataComp (OpenCLIP) pre-training. Despite variations across models, all spectra show pronounced anisotropy in the extreme top and bottom singular directions, while staying relatively flat in the middle band.
  • Figure 3: Investigation of different regions of the spectrum (top 50, middle 50, and bottom 50 directions) of the inter-modal operator $\Psi$ for aligning the CLIP Projector weights, as defined in \ref{['proj:equation']}, utilizing the ViT-B/16 model for both image-to-image and text-to-text retrieval tasks. (a) Analysis of cosine similarity distributions showing very high similarities for the top band, and more well-behaved distributions for middle and bottom bands. (b) Overlap between cosine similarities of positive and negative pairs showing better separation for the middle band but highly overlapping distributions for top and bottom bands, implying higher intra-modal misalignment. (c) Performance comparison showing far superior performance using the well-aligned middle band compared to top and bottom bands.
  • Figure 4: Analysis on Dogs vs. Catselson2007asirra. (left) IsoCLIP achieves higher precision than CLIP as the number of retrieved dog images $K$ increases. (center) CLIP shows significant overlap between intra-class (dog-dog) and inter-class (dog-cat) similarities due to intra-modal misalignment. (right) IsoCLIP reduces this overlap, making image-image similarities more discriminative.
  • Figure 5: Ablation showing the impact of varying the $k_t$ and $k_b$ values for the isotropic middle band selection on Caltech101 (image-to-image retrieval) and COCO (text-to-text retrieval).
  • ...and 5 more figures