Table of Contents
Fetching ...

When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification

Zirui Pang, Haosheng Tan, Yuhan Pu, Zhijie Deng, Zhouan Shen, Keyu Hu, Jiaheng Wei

TL;DR

This work tackles the pervasive issue of noisy and missing labels in widely used image classification benchmarks by introducing REVEAL, a dataset renovation framework that combines vision-language models with human-guided label curation. REVEAL constructs pseudo ground-truths via model voting, estimates model expertise, and performs weighted aggregation to produce soft, likelihood-based label assignments for missing or noisy annotations. The approach is evaluated across six benchmarks (e.g., CIFAR-10/100, ImageNet, Caltech256, QuickDraw, MNIST) and shows high alignment with human judgments in most cases, revealing that missing labels and label noise are systemic in public datasets. The framework demonstrates that combining VLMs with principled noise-curation methods yields more reliable test sets and enables more meaningful, open-vocabulary evaluations, while also offering insights into model-hierarchy biases and the cognitive alignment between AI and humans. Overall, REVEAL advances benchmark quality and provides a practical path toward richer, probabilistic annotations for multi-label image understanding.

Abstract

Image classification benchmark datasets such as CIFAR, MNIST, and ImageNet serve as critical tools for model evaluation. However, despite the cleaning efforts, these datasets still suffer from pervasive noisy labels and often contain missing labels due to the co-existing image pattern where multiple classes appear in an image sample. This results in misleading model comparisons and unfair evaluations. Existing label cleaning methods focus primarily on noisy labels, but the issue of missing labels remains largely overlooked. Motivated by these challenges, we present a comprehensive framework named REVEAL, integrating state-of-the-art pre-trained vision-language models (e.g., LLaVA, BLIP, Janus, Qwen) with advanced machine/human label curation methods (e.g., Docta, Cleanlab, MTurk), to systematically address both noisy labels and missing label detection in widely-used image classification test sets. REVEAL detects potential noisy labels and omissions, aggregates predictions from various methods, and refines label accuracy through confidence-informed predictions and consensus-based filtering. Additionally, we provide a thorough analysis of state-of-the-art vision-language models and pre-trained image classifiers, highlighting their strengths and limitations within the context of dataset renovation by revealing 10 observations. Our method effectively reveals missing labels from public datasets and provides soft-labeled results with likelihoods. Through human verifications, REVEAL significantly improves the quality of 6 benchmark test sets, highly aligning to human judgments and enabling more accurate and meaningful comparisons in image classification.

When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification

TL;DR

This work tackles the pervasive issue of noisy and missing labels in widely used image classification benchmarks by introducing REVEAL, a dataset renovation framework that combines vision-language models with human-guided label curation. REVEAL constructs pseudo ground-truths via model voting, estimates model expertise, and performs weighted aggregation to produce soft, likelihood-based label assignments for missing or noisy annotations. The approach is evaluated across six benchmarks (e.g., CIFAR-10/100, ImageNet, Caltech256, QuickDraw, MNIST) and shows high alignment with human judgments in most cases, revealing that missing labels and label noise are systemic in public datasets. The framework demonstrates that combining VLMs with principled noise-curation methods yields more reliable test sets and enables more meaningful, open-vocabulary evaluations, while also offering insights into model-hierarchy biases and the cognitive alignment between AI and humans. Overall, REVEAL advances benchmark quality and provides a practical path toward richer, probabilistic annotations for multi-label image understanding.

Abstract

Image classification benchmark datasets such as CIFAR, MNIST, and ImageNet serve as critical tools for model evaluation. However, despite the cleaning efforts, these datasets still suffer from pervasive noisy labels and often contain missing labels due to the co-existing image pattern where multiple classes appear in an image sample. This results in misleading model comparisons and unfair evaluations. Existing label cleaning methods focus primarily on noisy labels, but the issue of missing labels remains largely overlooked. Motivated by these challenges, we present a comprehensive framework named REVEAL, integrating state-of-the-art pre-trained vision-language models (e.g., LLaVA, BLIP, Janus, Qwen) with advanced machine/human label curation methods (e.g., Docta, Cleanlab, MTurk), to systematically address both noisy labels and missing label detection in widely-used image classification test sets. REVEAL detects potential noisy labels and omissions, aggregates predictions from various methods, and refines label accuracy through confidence-informed predictions and consensus-based filtering. Additionally, we provide a thorough analysis of state-of-the-art vision-language models and pre-trained image classifiers, highlighting their strengths and limitations within the context of dataset renovation by revealing 10 observations. Our method effectively reveals missing labels from public datasets and provides soft-labeled results with likelihoods. Through human verifications, REVEAL significantly improves the quality of 6 benchmark test sets, highly aligning to human judgments and enabling more accurate and meaningful comparisons in image classification.

Paper Structure

This paper contains 33 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Exemplary CIFAR-100 test set with noisy labels. The text below each picture denotes the CIFAR-100 original label (first row) and the cleaned label in CIFAR-100 by northcutt2021labelerrors (second row).
  • Figure 2: Exemplary CIFAR-100 training images with multiple labels. The text below each picture denotes the CIFAR-100 original label (first row) and the human annotated supplementary label (second row). We did not exhaust all possible labels subjectively.
  • Figure 3: REVEAL renovation pipeline. Both the VLM-based and human-annotated methods first assign labels to each image independently. These preliminary labels are then aggregated using a weighted voting ensembling strategy. To refine the results, a score threshold is applied to filter the aggregated labels, followed by a softmax operation to compute the corresponding likelihoods. This process ultimately yields a soft-labeled output suitable for downstream tasks.
  • Figure 4: Evaluation on different settings of prompt. Results shown from left to right are from Janus, LLaVA, Qwen, respectively. Results are evaluated on first 100 images of CIFAR-100 test data. To balance the running time and recall, our label batch size is set to be 20 accordingly.
  • Figure 5: Models Pairs Comparison. These three sub-figures illustrate confusion matrix of renovation results between VLMs: Janus/Qwen, Janus/LLaVA, Qwen/LLaVA respectively.
  • ...and 3 more figures