Table of Contents
Fetching ...

Better Language Models Exhibit Higher Visual Alignment

Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano

TL;DR

The paper interrogates whether text-only LLMs encode visually grounded knowledge by evaluating zero-shot generalization to novel concepts via a discriminative CLIP-like framework with frozen backbones. It systematically compares encoder- and decoder-based LLMs, finding that decoder-based models exhibit stronger visual alignment, and shows that language-model capability correlates with visual generalization. Building on this insight, the authors introduce ShareLock, a lightweight method that fuses frozen vision and language backbones with an enhanced projection head to achieve competitive vision-language performance using far less paired data than traditional VLMs, while delivering strong cross-lingual transfer. The work demonstrates the practical value of leveraging pretrained LLMs for efficient, multilingual vision-language systems and provides guidance on architectural choices for future multimodal models. Overall, decoder-based LLMs emerge as a rich source of visually relevant representations that can be effectively harnessed to build data-efficient, multilingual VLMs.

Abstract

How well do text-only large language models (LLMs) align with the visual world? We present a systematic evaluation of this question by incorporating frozen representations of various language models into a discriminative vision-language framework and measuring zero-shot generalization to novel concepts. We find that decoder-based models exhibit stronger visual alignment than encoders, even when controlling for model and dataset size. Moreover, language modeling performance correlates with visual generalization, suggesting that advances in unimodal LLMs can simultaneously improve vision models. Leveraging these insights, we propose ShareLock, a lightweight method for fusing frozen vision and language backbones. ShareLock achieves robust performance across tasks while drastically reducing the need for paired data and compute. With just 563k image-caption pairs and under one GPU-hour of training, it reaches 51% accuracy on ImageNet. In cross-lingual settings, ShareLock dramatically outperforms CLIP, achieving 38.7% top-1 accuracy on Chinese image classification versus CLIP's 1.4%. Code is available.

Better Language Models Exhibit Higher Visual Alignment

TL;DR

The paper interrogates whether text-only LLMs encode visually grounded knowledge by evaluating zero-shot generalization to novel concepts via a discriminative CLIP-like framework with frozen backbones. It systematically compares encoder- and decoder-based LLMs, finding that decoder-based models exhibit stronger visual alignment, and shows that language-model capability correlates with visual generalization. Building on this insight, the authors introduce ShareLock, a lightweight method that fuses frozen vision and language backbones with an enhanced projection head to achieve competitive vision-language performance using far less paired data than traditional VLMs, while delivering strong cross-lingual transfer. The work demonstrates the practical value of leveraging pretrained LLMs for efficient, multilingual vision-language systems and provides guidance on architectural choices for future multimodal models. Overall, decoder-based LLMs emerge as a rich source of visually relevant representations that can be effectively harnessed to build data-efficient, multilingual VLMs.

Abstract

How well do text-only large language models (LLMs) align with the visual world? We present a systematic evaluation of this question by incorporating frozen representations of various language models into a discriminative vision-language framework and measuring zero-shot generalization to novel concepts. We find that decoder-based models exhibit stronger visual alignment than encoders, even when controlling for model and dataset size. Moreover, language modeling performance correlates with visual generalization, suggesting that advances in unimodal LLMs can simultaneously improve vision models. Leveraging these insights, we propose ShareLock, a lightweight method for fusing frozen vision and language backbones. ShareLock achieves robust performance across tasks while drastically reducing the need for paired data and compute. With just 563k image-caption pairs and under one GPU-hour of training, it reaches 51% accuracy on ImageNet. In cross-lingual settings, ShareLock dramatically outperforms CLIP, achieving 38.7% top-1 accuracy on Chinese image classification versus CLIP's 1.4%. Code is available.

Paper Structure

This paper contains 46 sections, 2 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Visual generalization vs language comprehension. Language modeling capability on MMLU-Pro predicts LLM visual transfer performance (Pearson-$r$: $0.768$). We compute a visual generalization score by aligning language with vision features in a CLIP-like framework and evaluating on disjoint sets of unaligned classes across four datasets. Dot size is proportional to the LLM's parameter count.
  • Figure 2: Comparison of training modes.(Left) Web-scale VLMs training lacks concept control, weakening generalization claims and resulting in erratic drops for rare categories. (Center) Our visual alignment probing protocol enforces strict concept separation to assess true generalization. (Right) Our ShareLock method uses lightweight projections to align frozen unimodal models via CLIP-style contrastive learning and zero-shot evaluation.
  • Figure 3: Encoder- vs decoder-based language models. The models are trained on identical data with matched model size, isolating the effects of their pretraining objectives. Decoders demonstrate higher visual alignment across all model sizes compared to encoder models.
  • Figure 4: Our ShareLockvs previous methods. Compared to CLIP and LiT, ShareLock utilizes frozen pretrained representations for both modalities, allowing extremely efficient training. Using this framework, we assess how "visual" frozen language models' text representations are by how strong the resulting model can generalize to entirely novel categories. We find decoder-only LLMs to yield strong performances for zero-shot generalization and hence incorporate them as the text backbones.
  • Figure 5: Scaling of image-text dataset size.ShareLock outperforms other models despite using notably fewer datapoints.
  • ...and 4 more figures