A Vision Check-up for Language Models
Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba
TL;DR
This work investigates whether language models can acquire and convey visual knowledge despite lacking pixel inputs. It introduces a Visual Aptitude Dataset and a text-to-code-to-image pipeline to evaluate generation and recognition capabilities, augmented by a self-feedback mechanism that iteratively improves renderings. It goes further to show that images produced by LLMs can train vision backbones (MoCo-v2 on 1.3M images) and yield competitive results when textures are integrated, demonstrating the utility of purely textual models for scalable visual data. Overall, the findings suggest that LLMs capture meaningful visual structure and can generate useful synthetic data to augment natural-image datasets, with implications for data-efficient vision pretraining and vision-language integration.
Abstract
What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.
