Table of Contents
Fetching ...

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

Dhruba Ghosh, Yuhui Zhang, Ludwig Schmidt

TL;DR

This work finds that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance, and finds that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining.

Abstract

Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

TL;DR

This work finds that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance, and finds that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining.

Abstract

Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.
Paper Structure (30 sections, 8 figures, 4 tables)

This paper contains 30 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview.(Top) We investigate the fine-grained classification capabilities of vision-language models (VLMs), a crucial yet often overlooked aspect that underpins higher-level understanding and reasoning. (Bottom) Through 22 systematic ablation experiments on key model components and training strategies (left), we identify the factors that drive fine-grained classification performance in VLMs (right).
  • Figure 2: Fine-grained classification compared to general VQA performance across VLMs. Analysis of recent VLMs indicates that fine-grained classification represents a distinct aspect of visual capability that standard VQA benchmarks fail to measure.
  • Figure 3: Comparison of VLMs with their corresponding CLIP vision encoders in fine-grained classification. While Qwen2-VL-Chat nearly matches the performance of its vision encoder DFN-CLIP, all other VLMs fall significantly behind. This highlights that VLMs have considerable room for improvement in fine-grained classification tasks.
  • Figure 4: Ablating base LLM, swapping out Vicuna-7B from LLaVA with other LLMs. On average, switching from Vicuna to Qwen2-7B results in a +7.5pp increase in fine-grained performance and a +7.7pp improvement in general VQA performance.
  • Figure 5: Ablating base vision encoder, swapping out CLIP ViT-L/14 from LLaVA for DFN-CLIP ViT-H/14. We find that switching to DFN-CLIP improves fine-grained performance by +4.5pp and general performance by +1.2pp, given enough pretraining.
  • ...and 3 more figures