Better Language Models Exhibit Higher Visual Alignment
Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano
TL;DR
The paper interrogates whether text-only LLMs encode visually grounded knowledge by evaluating zero-shot generalization to novel concepts via a discriminative CLIP-like framework with frozen backbones. It systematically compares encoder- and decoder-based LLMs, finding that decoder-based models exhibit stronger visual alignment, and shows that language-model capability correlates with visual generalization. Building on this insight, the authors introduce ShareLock, a lightweight method that fuses frozen vision and language backbones with an enhanced projection head to achieve competitive vision-language performance using far less paired data than traditional VLMs, while delivering strong cross-lingual transfer. The work demonstrates the practical value of leveraging pretrained LLMs for efficient, multilingual vision-language systems and provides guidance on architectural choices for future multimodal models. Overall, decoder-based LLMs emerge as a rich source of visually relevant representations that can be effectively harnessed to build data-efficient, multilingual VLMs.
Abstract
How well do text-only large language models (LLMs) align with the visual world? We present a systematic evaluation of this question by incorporating frozen representations of various language models into a discriminative vision-language framework and measuring zero-shot generalization to novel concepts. We find that decoder-based models exhibit stronger visual alignment than encoders, even when controlling for model and dataset size. Moreover, language modeling performance correlates with visual generalization, suggesting that advances in unimodal LLMs can simultaneously improve vision models. Leveraging these insights, we propose ShareLock, a lightweight method for fusing frozen vision and language backbones. ShareLock achieves robust performance across tasks while drastically reducing the need for paired data and compute. With just 563k image-caption pairs and under one GPU-hour of training, it reaches 51% accuracy on ImageNet. In cross-lingual settings, ShareLock dramatically outperforms CLIP, achieving 38.7% top-1 accuracy on Chinese image classification versus CLIP's 1.4%. Code is available.
