Table of Contents
Fetching ...

A Vision Check-up for Language Models

Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba

TL;DR

This work investigates whether language models can acquire and convey visual knowledge despite lacking pixel inputs. It introduces a Visual Aptitude Dataset and a text-to-code-to-image pipeline to evaluate generation and recognition capabilities, augmented by a self-feedback mechanism that iteratively improves renderings. It goes further to show that images produced by LLMs can train vision backbones (MoCo-v2 on 1.3M images) and yield competitive results when textures are integrated, demonstrating the utility of purely textual models for scalable visual data. Overall, the findings suggest that LLMs capture meaningful visual structure and can generate useful synthetic data to augment natural-image datasets, with implications for data-efficient vision pretraining and vision-language integration.

Abstract

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

A Vision Check-up for Language Models

TL;DR

This work investigates whether language models can acquire and convey visual knowledge despite lacking pixel inputs. It introduces a Visual Aptitude Dataset and a text-to-code-to-image pipeline to evaluate generation and recognition capabilities, augmented by a self-feedback mechanism that iteratively improves renderings. It goes further to show that images produced by LLMs can train vision backbones (MoCo-v2 on 1.3M images) and yield competitive results when textures are integrated, demonstrating the utility of purely textual models for scalable visual data. Overall, the findings suggest that LLMs capture meaningful visual structure and can generate useful synthetic data to augment natural-image datasets, with implications for data-efficient vision pretraining and vision-language integration.

Abstract

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.
Paper Structure (11 sections, 9 figures, 2 tables)

This paper contains 11 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Vision check-up for LLMs. I. Testing the visual knowledge of Language Models. We suggest a set of tests to check the vision abilities of language models, these include (a) the ability to write code that renders complex visual concepts (b) recognizing visual concepts from code (c) correcting rendering code with text-only self-feedback. II. We test whether LLMs can generate data to train a high-performance vision system that can be used to make semantic judgments on natural images.
  • Figure 2: Visual Aptitude Dataset. We collect a dataset of visual concepts of including shapes, objects and scenes, and ask LLMs to generate corresponding images using a Text $\rightarrow$ Code $\rightarrow$ Image generation procedure. Guess the captions of the scenes!
  • Figure 3: Image-Text Fidelity. Median CLIP image-text retrieval percentiles of images generated by different LLM. We include Stable Diffusion as an Oracle. Chance is $50\%$.
  • Figure 4: Realism vs. Diversity. With both sampling strategies, LLMs are able to draw diverse illustrations of the same concept.
  • Figure 5: Diversity. LLMs are capable of generating diverse meaningful instances of the same concept, showcasing their ability to represent concepts beyond a single fixed prototype.
  • ...and 4 more figures