Table of Contents
Fetching ...

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, Noel E. O'Connor

TL;DR

The paper tackles the resource-intensive nature of CLIP-style vision-language models by leveraging frozen unimodal encoders and learning only lightweight projectors to align their embedding spaces. It introduces a three-part framework: (i) encoder-pair selection via Centered Kernel Alignment (CKA), (ii) concept-rich data curation to construct dense, representative concept prototypes, and (iii) a lightweight projector architecture that fuses local and global token information. Empirical results across 12 zero-shot classification datasets, 2 image-text retrieval tasks, multilingual and long-context settings, show competitive or superior performance to scratch-trained CLIP models, with ImageNet accuracy reaching 76.3% using only ~20M data and ~50–60 GPU-hours of alignment compute. The approach significantly lowers data and compute requirements while preserving strong unimodal features, enabling flexible adaptation to diverse tasks and languages and broadening access to multimodal AI development.

Abstract

Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Is there a plausible way to connect unimodal backbones for vision-language tasks? To this end, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on 12 zero-shot classification datasets and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76\(\%\) accuracy on ImageNet with a 20-fold reduction in data and 65-fold reduction in compute requirements compared multi-modal alignment where models are trained from scratch. The proposed framework enhances the accessibility of multimodal model development while enabling flexible adaptation across diverse scenarios. Code and curated datasets are available at \texttt{github.com/mayug/freeze-align}.

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

TL;DR

The paper tackles the resource-intensive nature of CLIP-style vision-language models by leveraging frozen unimodal encoders and learning only lightweight projectors to align their embedding spaces. It introduces a three-part framework: (i) encoder-pair selection via Centered Kernel Alignment (CKA), (ii) concept-rich data curation to construct dense, representative concept prototypes, and (iii) a lightweight projector architecture that fuses local and global token information. Empirical results across 12 zero-shot classification datasets, 2 image-text retrieval tasks, multilingual and long-context settings, show competitive or superior performance to scratch-trained CLIP models, with ImageNet accuracy reaching 76.3% using only ~20M data and ~50–60 GPU-hours of alignment compute. The approach significantly lowers data and compute requirements while preserving strong unimodal features, enabling flexible adaptation to diverse tasks and languages and broadening access to multimodal AI development.

Abstract

Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Is there a plausible way to connect unimodal backbones for vision-language tasks? To this end, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on 12 zero-shot classification datasets and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76 accuracy on ImageNet with a 20-fold reduction in data and 65-fold reduction in compute requirements compared multi-modal alignment where models are trained from scratch. The proposed framework enhances the accessibility of multimodal model development while enabling flexible adaptation across diverse scenarios. Code and curated datasets are available at \texttt{github.com/mayug/freeze-align}.
Paper Structure (36 sections, 2 equations, 12 figures, 15 tables)

This paper contains 36 sections, 2 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: CLIP Loss minima vs CKA for different encoder pairs on a toy image, caption pair dataset. We plot the CLIP loss after 500 iterations vs CKA for different image, text encoders and find that a negative correlation exists between CKA and ease of alignment.
  • Figure 2: Overview of our concept-balanced dataset curation process. Images for each concept are acquired from curated datasets and mapped to CLIP embeddings and averaged to construct Image Prototypes for each concept. Captions of the uncurated dataset are mapped to CLIP's joint embedding space and 2000 samples are picked per concept on the basis of the closest caption embeddings to each concept image prototype.
  • Figure 3: Lightweight Projector Architecture. We train only Projection Layers to align modalities. Separate projectors are applied on both the local tokens and the CLS token for each encoder and then combined in a residual manner.
  • Figure 4: Retrieval performance vs. CKA for different encoder pairs. Text retrieval accuracies on Flickr30k are compared to CKA, calculated on the COCO val set. Projectors are trained on the COCO train set. A clear correlation exists between CKA and alignment quality, as reflected in the retrieval accuracies.
  • Figure 5: Retrieval performance comparison between DINOv2-ARL encoder pair and OpenAI CLIP as the maximum token length increases. The vertical green line indicates the standard CLIP token limit of 77.
  • ...and 7 more figures