Table of Contents
Fetching ...

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu, Xiu-Shen Wei, Yuxin Peng, Serge Belongie

TL;DR

The paper addresses the lack of systematic evaluation of large vision-language models on fine-grained image tasks. It introduces FG-BMK, a benchmark with 1.01 million questions and 0.33 million images, spanning human-oriented dialogue-based assessment and machine-oriented retrieval/classification. Across twelve fine-grained datasets and a broad set of LVLMs, it finds that contrastive training enhances fine-grained discriminability, while alignment with granular textual content can hinder performance; robustness to perturbations remains a weakness and gains from scaling are limited. The work reveals that LVLMs lag behind specialized fine-grained models, offering guidance for targeted data curation and model design to push toward more capable fine-grained vision-language systems.

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

TL;DR

The paper addresses the lack of systematic evaluation of large vision-language models on fine-grained image tasks. It introduces FG-BMK, a benchmark with 1.01 million questions and 0.33 million images, spanning human-oriented dialogue-based assessment and machine-oriented retrieval/classification. Across twelve fine-grained datasets and a broad set of LVLMs, it finds that contrastive training enhances fine-grained discriminability, while alignment with granular textual content can hinder performance; robustness to perturbations remains a weakness and gains from scaling are limited. The work reveals that LVLMs lag behind specialized fine-grained models, offering guidance for targeted data curation and model design to push toward more capable fine-grained vision-language systems.

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

Paper Structure

This paper contains 35 sections, 22 figures, 12 tables.

Figures (22)

  • Figure 1: Our proposed benchmark: The human-oriented evaluation tests the model’s ability to handle fine-grained visual queries (true/false, multiple-choice, short-answer), while the machine-oriented evaluation directly assesses visual feature representation through image retrieval and classification tasks. =true/false question, =multiple-choice question, =short-answer question.
  • Figure 2: Results of InternVL3 zhu2025internvl3 on true/false and multiple-choice questions across different levels of granularity on the CUB-200-2011CUB dataset. The $x$-axis denotes the granularity of the recognition questions.
  • Figure 3: Comparison of the original and fine-tuned LLaVA models on occurrence-balanced fine-grained bird categories. True/false question accuracy for each category is ranked, with blue dots representing the original model and yellow dots the fine-tuned model.
  • Figure 4: Retrieval results of LVLM visual features on twelve fine-grained datasets. Different colors represent different models.
  • Figure 5: Classification results of LVLM visual features on twelve fine-grained datasets. Different colors represent different models.
  • ...and 17 more figures