Multimodal Large Language Models as Image Classifiers

Nikita Kisel; Illia Volkov; Klara Janouskova; Jiri Matas

Multimodal Large Language Models as Image Classifiers

Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas

TL;DR

It is shown that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.

Abstract

Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation. This work is part of the Aiming for Perfect ImageNet-1k project, see https://klarajanouskova.github.io/ImageNet/.

Multimodal Large Language Models as Image Classifiers

TL;DR

Abstract

Paper Structure (33 sections, 2 equations, 11 figures, 22 tables)

This paper contains 33 sections, 2 equations, 11 figures, 22 tables.

Introduction
Evaluation setup
Dataset and labels
Evaluation metric
Tasks
Model and class names overview
Results
Closed-World
Multiple-Choice
Open-World
Case study: ChatGPT vs. humans
Conclusions
Related Work
Evalutation Setup - Details
Prompt overview
...and 18 more sections

Figures (11)

Figure 1: Challenging visual recognition cases from ImageNet-1k. ImGT: original single ImageNet label, ReGT: our reannotations. Predictions of representative models, including DINOv3, EfficientNetV2, EfficientNet-L2, EVA-02 (self-)supervised on ImageNet, are often wrong on such data. Correct predictions in green.
Figure 2: The three evaluated classification tasks — OW, MC, and CW(+) — described in \ref{['subsec:tasks']}
Figure 3: Classification accuracy on ImageNet-1k, change from the original (ImGT) to reannotated labels (ReGT).
Figure 4: To address mllm out-of-prompt (OOP) predictions, often referred to as hallucinationsliu2024revisitingmllmsindepthanalysiszhang2024visuallygroundedlanguagemodelsbad, we map model outputs that fall outside the provided class list to the nearest in-prompt class using the best model-specific encoder. Examples where the mapping is correct are shown. Columns correspond to the label categories from which the images were sampled. The row shows Qwen3-VL predictions, for which 38.75% OOP predictions are correctly mapped. "OOP" denotes an out-of-prompt prediction, "Map" indicates the mapped prediction, "ReGT" refers to reannotation of the image, and "ImGT" represents the original ImageNet label. Green and red stands for correct / incorrect image label.
Figure 5: Images from the second annotation pass, where annotators 1. preferred GPT-4o prediction over the reannotated labels (left), 2. retained at least part of the reannotated labels without adding GPT-4o prediction (right), and 3. combined reannotations with GPT-4o predictions (middle). "ImGT" stands for ImageNet-1k ground truth, "ChatGPT" denotes GPT-4o prediction for the image, "ReGT" refers to first pass reannotations, and "ReReGT" indicates second pass reannotations. Annotators in the second pass not only preserved correct ReGT or added GPT-4o predictions, but also identified and corrected erroneous labels from the first pass.
...and 6 more figures

Multimodal Large Language Models as Image Classifiers

TL;DR

Abstract

Multimodal Large Language Models as Image Classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (11)