Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu; Xiu-Shen Wei; Yuxin Peng; Serge Belongie

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu, Xiu-Shen Wei, Yuxin Peng, Serge Belongie

TL;DR

The paper addresses the lack of systematic evaluation of large vision-language models on fine-grained image tasks. It introduces FG-BMK, a benchmark with 1.01 million questions and 0.33 million images, spanning human-oriented dialogue-based assessment and machine-oriented retrieval/classification. Across twelve fine-grained datasets and a broad set of LVLMs, it finds that contrastive training enhances fine-grained discriminability, while alignment with granular textual content can hinder performance; robustness to perturbations remains a weakness and gains from scaling are limited. The work reveals that LVLMs lag behind specialized fine-grained models, offering guidance for targeted data curation and model design to push toward more capable fine-grained vision-language systems.

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

TL;DR

Abstract

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)