Table of Contents
Fetching ...

Why are Visually-Grounded Language Models Bad at Image Classification?

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

TL;DR

<3-5 sentence high-level summary> This paper systematically evaluates visually-grounded language models (VLMs) on image classification and finds a substantial gap compared with CLIP-based baselines. Through a multi-faceted analysis of inference, training, and data, the authors identify data availability as the primary driver of poor classification performance, showing that with sufficient classification data VLMs can reach state-of-the-art accuracy. They demonstrate that training VLMs with classification-focused data not only improves object recognition but also enhances broader capabilities, as evidenced by an 11.8% gain on ImageWikiQA. They further introduce ImageWikiQA and show that combining classification data with instruction-tuning preserves and expands VLM capabilities, supporting a data-centric path to stronger visual reasoning systems.

Abstract

Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM's performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.

Why are Visually-Grounded Language Models Bad at Image Classification?

TL;DR

<3-5 sentence high-level summary> This paper systematically evaluates visually-grounded language models (VLMs) on image classification and finds a substantial gap compared with CLIP-based baselines. Through a multi-faceted analysis of inference, training, and data, the authors identify data availability as the primary driver of poor classification performance, showing that with sufficient classification data VLMs can reach state-of-the-art accuracy. They demonstrate that training VLMs with classification-focused data not only improves object recognition but also enhances broader capabilities, as evidenced by an 11.8% gain on ImageWikiQA. They further introduce ImageWikiQA and show that combining classification data with instruction-tuning preserves and expands VLM capabilities, supporting a data-centric path to stronger visual reasoning systems.

Abstract

Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM's performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.
Paper Structure (74 sections, 9 figures, 14 tables)

This paper contains 74 sections, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overview.(Left) Different visually-grounded language models (VLMs) underperform CLIP in classification by a large margin, though they often use CLIP as a vision encoder. (Middle) We investigate several hypotheses about why VLMs are bad classifiers and find that the main reason is data. Critical information for image classification is encoded in the VLM's latent space but can only be decoded with enough data during VLM training. (Right) Based on our analysis, we improve a VLM by integrating classification data into its training, and find that the improved classification capabilities serve as foundations for more advanced capabilities such as visual question answering.
  • Figure 2: Analysis of the label set size. For each image, we randomly sample 100, 20, 5, 2 candidate classes from all the classes. The performance gap between VLMs and CLIPs becomes smaller when the number of classes is reduced. X-axis: number of classes; Y-axis: accuracy (%).
  • Figure 3: Analysis of VLMs from the data perspective. We study the relation between the ImageNet class frequency in the VLM training data and the VLM's classification performance on those classes. A strong correlation is observed, indicating that data determines VLM classification performance.
  • Figure 4: Analysis of the label set size. For each image, we randomly sample 100, 20, 5, and 2 candidate classes from the entire set of classes. While the absolute performance gap between VLMs and CLIPs decreases as the number of classes is reduced, the relative performance gap increases. The X-axis represents the number of classes, and the Y-axis represents the relative error rate between LLaVA1.5-7B and CLIP-L.
  • Figure 5: Fine-tuning only the projector improves numerical stability.(Top) Fine-tuning LLMs with LoRA often results in numerical instabilities, manifesting as spikes in loss (purple, green, brown, orange curves). In contrast, fine-tuning only the projector leads to a consistently steady decrease in loss (teal curve). Despite experimenting with various hyperparameters for ImageNet, the instability remained. (Bottom) Occasionally, the spikes normalize with continued training. Here, we present an example using the StanfordCars dataset (pink curve).
  • ...and 4 more figures