Table of Contents
Fetching ...

Do Vision Models Develop Human-Like Progressive Difficulty Understanding?

Zeyi Huang, Utkarsh Ojha, Yuyang Ji, Donghyun Lee, Yong Jae Lee

TL;DR

The study investigates whether vision models exhibit human-like progressive difficulty understanding by generating a synthetic, difficulty-annotated image dataset (36,000 images across 100 categories, 10 attributes, 3 difficulty levels) using GPT-4 and DALL-E 3. It introduces the Hierarchical Learning Score to quantify whether models respond to harder prompts only after mastering easier ones, and demonstrates that six popular architectures exhibit high adherence to this principle (roughly $>85%$ of triplets). A GRE-inspired adaptive testing framework is proposed to efficiently estimate overall performance using only a subset of images, achieving close alignment with full evaluation via a GRE-style score $Score = 1\cdot\text{correct}_{easy} + 2\cdot\text{correct}_{medium} + 4\cdot\text{correct}_{hard}$ and showing robust results across model types. Human-perception studies corroborate that generated difficulty correlates strongly with perceived difficulty, supporting the dataset’s validity. The work offers a practical, scalable approach to probing learning dynamics in vision models and enables faster, attribute-aware model evaluation through adaptive testing and synthetic data.

Abstract

When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question $(2 \times 3)$ incorrectly, they would likely answer a more difficult one $(2 \times 3 \times 4)$ incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibit a similarly structured learning capacity? In this work, we consider the task of image classification and study if those models' responses follow that pattern. Since real images aren't labeled with difficulty, we first create a dataset of 100 categories, 10 attributes, and 3 difficulty levels using recent generative models: for each category (e.g., dog) and attribute (e.g., occlusion), we generate images of increasing difficulty (e.g., a dog without occlusion, a dog only partly visible). We find that most of the models do in fact behave similarly to the aforementioned pattern around 80-90% of the time. Using this property, we then explore a new way to evaluate those models. Instead of testing the model on every possible test image, we create an adaptive test akin to GRE, in which the model's performance on the current round of images determines the test images in the next round. This allows the model to skip over questions too easy/hard for itself, and helps us get its overall performance in fewer steps.

Do Vision Models Develop Human-Like Progressive Difficulty Understanding?

TL;DR

The study investigates whether vision models exhibit human-like progressive difficulty understanding by generating a synthetic, difficulty-annotated image dataset (36,000 images across 100 categories, 10 attributes, 3 difficulty levels) using GPT-4 and DALL-E 3. It introduces the Hierarchical Learning Score to quantify whether models respond to harder prompts only after mastering easier ones, and demonstrates that six popular architectures exhibit high adherence to this principle (roughly of triplets). A GRE-inspired adaptive testing framework is proposed to efficiently estimate overall performance using only a subset of images, achieving close alignment with full evaluation via a GRE-style score and showing robust results across model types. Human-perception studies corroborate that generated difficulty correlates strongly with perceived difficulty, supporting the dataset’s validity. The work offers a practical, scalable approach to probing learning dynamics in vision models and enables faster, attribute-aware model evaluation through adaptive testing and synthetic data.

Abstract

When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question incorrectly, they would likely answer a more difficult one incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibit a similarly structured learning capacity? In this work, we consider the task of image classification and study if those models' responses follow that pattern. Since real images aren't labeled with difficulty, we first create a dataset of 100 categories, 10 attributes, and 3 difficulty levels using recent generative models: for each category (e.g., dog) and attribute (e.g., occlusion), we generate images of increasing difficulty (e.g., a dog without occlusion, a dog only partly visible). We find that most of the models do in fact behave similarly to the aforementioned pattern around 80-90% of the time. Using this property, we then explore a new way to evaluate those models. Instead of testing the model on every possible test image, we create an adaptive test akin to GRE, in which the model's performance on the current round of images determines the test images in the next round. This allows the model to skip over questions too easy/hard for itself, and helps us get its overall performance in fewer steps.

Paper Structure

This paper contains 28 sections, 1 equation, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Left: Sample images from our proposed test set generated using GPT-4 + DALL-E 3. For the class of golden retriever and attribute occlusion, we generate images of varying difficulty. Intuitively, it is easier to classify the leftmost image as golden retriever compared to rightmost image. Right: Possible responses (correct/incorrect) of a model on the easy/medium/hard image on the left side. Cannot solve easy one $\rightarrow$ cannot solve difficult one. Can solve difficult one $\rightarrow$ can solve easy one: this hypothesis is only satisfied in 4 (in red) out of 8 possibilities.
  • Figure 2: Visualizing the difficulty of test samples. All of the images are generated using our proposed pipeline. In each quadrant, we focus on one attribute (e.g., lighting, in the top left), and from left to right we show the images becoming progressively more difficult to be classified correctly.
  • Figure 3: Overview of the test set generation process. The first step is to collect the names of the image categories that we wish to test the models on. We then prompt GPT-4 to generate the appropriate attribute values for those categories with various levels of difficulty. Using those, we again prompt GPT-4 to generate text prompts for a category (golden retriever), attribute (heavy occlusion) combination. Finally, we use DALL-E 3 to generate the corresponding images.
  • Figure 4: Left: Plots depicting % of model's behavior on 12k triplets over the 8 possible patterns of Easy, Medium, Hard. The bars corresponding to principle-following pattern are colored green; others, red. All models behave according to the hierarchical learning principle. Top right: Hierarchical learning score of 6 vision models. Most achieve a score higher than $85\%$. Bottom right: Scatter plot of top-1 accuracy on our test set vs hierarchical learning score of 12 models. PCC value is 0.77.
  • Figure 5: % of our dataset grouped according to classification confidence for the Easy, Medium, and Hard difficulty levels. We average the sample numbers across six selected classifiers (ViT-B16 dosovitskiy2020image, ConvNext liu2022convnet, ResNet-101 he2016deep, trained on ImageNet1k deng2009imagenet and LAION schuhmann2022laion). See Appendix for more confidence visualization of different classifiers.
  • ...and 9 more figures