Table of Contents
Fetching ...

African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Gregor Geigle, Radu Timofte, Goran Glavaš

TL;DR

This paper introduces FOCI, a four-option multiple-choice benchmark for fine-grained object classification to probe LVLMs beyond standard image understanding tests. By mining hard negatives with a CLIP zero-shot model and combining nine domain datasets with four IN-21k subsets, it creates a challenging evaluation that reveals a gap between LVLMs and their CLIP encoders. Across 12 public LVLMs, results show that LVLMs often underperform compared to their underlying encoders, with performance heavily influenced by alignment data and explicit captioning of objects. The findings argue for stronger visio-linguistic alignment and targeted training data to improve fine-grained recognition, and position FOCI as a valuable, complementary benchmark for future LVLM development.

Abstract

Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public LVLMs on \texttt{FOCI} and show that it tests for a \textit{complementary skill} to established image understanding and reasoning benchmarks. Crucially, CLIP models exhibit dramatically better performance than LVLMs. Since the image encoders of LVLMs come from these CLIP models, this points to inadequate alignment for fine-grained object distinction between the encoder and the LLM and warrants (pre)training data with more fine-grained annotation. We release our code at \url{https://github.com/gregor-ge/FOCI-Benchmark}.

African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

TL;DR

This paper introduces FOCI, a four-option multiple-choice benchmark for fine-grained object classification to probe LVLMs beyond standard image understanding tests. By mining hard negatives with a CLIP zero-shot model and combining nine domain datasets with four IN-21k subsets, it creates a challenging evaluation that reveals a gap between LVLMs and their CLIP encoders. Across 12 public LVLMs, results show that LVLMs often underperform compared to their underlying encoders, with performance heavily influenced by alignment data and explicit captioning of objects. The findings argue for stronger visio-linguistic alignment and targeted training data to improve fine-grained recognition, and position FOCI as a valuable, complementary benchmark for future LVLM development.

Abstract

Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public LVLMs on \texttt{FOCI} and show that it tests for a \textit{complementary skill} to established image understanding and reasoning benchmarks. Crucially, CLIP models exhibit dramatically better performance than LVLMs. Since the image encoders of LVLMs come from these CLIP models, this points to inadequate alignment for fine-grained object distinction between the encoder and the LLM and warrants (pre)training data with more fine-grained annotation. We release our code at \url{https://github.com/gregor-ge/FOCI-Benchmark}.
Paper Structure (18 sections, 7 figures, 9 tables)

This paper contains 18 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The importance of object recognition: LLaVA 1.5 fails to identify the dog breed. Idefics-2 correctly recognizes it and gives a correct fact as a result.
  • Figure 2: Testing LVLMs on object classification through multiple-choice: (1) We compute the CLIP cosine similarity between a test image and class labels; we select the correct label and the three most similar (wrong) labels to (2) formulate a multiple-choice problem, which (3) is given to the LVLM who has to predict the correct choice.
  • Figure 3: We plot the LVLM accuracy against the CLIP zero-shot accuracy (using the 4 multiple-choice options for CLIP for a fair comparison) of the underlying CLIP image encoder used by the LVLM.
  • Figure 4: Accuracy of three LVLMs on ImageNet-1k, for example subsets on which the zero-shot classification with the corresponding CLIP model is (in)correct.
  • Figure 5: Results with MobileVLM v2 over its three LLM sizes with otherwise identical training.
  • ...and 2 more figures