Table of Contents
Fetching ...

Mind the (Data) Gap: Evaluating Vision Systems in Small Data Applications

Samuel Stevens, S M Rayeed, Jenna Kline

TL;DR

This work argues that the small-data regime—where tens to thousands of labeled samples are available—is essential for real-world vision applications but underexplored in current AI research. Using the NeWT ecological benchmark, it compares multimodal large language models (MLLMs) against vision encoders paired with SVMs across varying data sizes, revealing that MLLMs rapidly improve at very small data but plateau after roughly 10–30 examples, while vision-only methods continue to scale with more data. The study provides the first systematic small-data comparison between these approaches, showing distinct scaling behaviors and highlighting how pre-training supervision differently shapes representations for fine-grained versus semantic tasks. The findings advocate for explicit small-data evaluations in AI benchmarks to better align research progress with deployment needs and practical constraints.

Abstract

The practical application of AI tools for specific computer vision tasks relies on the "small-data regime" of hundreds to thousands of labeled samples. This small-data regime is vital for applications requiring expensive expert annotations, such as ecological monitoring, medical diagnostics or industrial quality control. We find, however, that computer vision research has ignored the small data regime as evaluations increasingly focus on zero- and few-shot learning. We use the Natural World Tasks (NeWT) benchmark to compare multi-modal large language models (MLLMs) and vision-only methods across varying training set sizes. MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples. We provide the first comprehensive comparison between these approaches in small-data contexts and advocate for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments.

Mind the (Data) Gap: Evaluating Vision Systems in Small Data Applications

TL;DR

This work argues that the small-data regime—where tens to thousands of labeled samples are available—is essential for real-world vision applications but underexplored in current AI research. Using the NeWT ecological benchmark, it compares multimodal large language models (MLLMs) against vision encoders paired with SVMs across varying data sizes, revealing that MLLMs rapidly improve at very small data but plateau after roughly 10–30 examples, while vision-only methods continue to scale with more data. The study provides the first systematic small-data comparison between these approaches, showing distinct scaling behaviors and highlighting how pre-training supervision differently shapes representations for fine-grained versus semantic tasks. The findings advocate for explicit small-data evaluations in AI benchmarks to better align research progress with deployment needs and practical constraints.

Abstract

The practical application of AI tools for specific computer vision tasks relies on the "small-data regime" of hundreds to thousands of labeled samples. This small-data regime is vital for applications requiring expensive expert annotations, such as ecological monitoring, medical diagnostics or industrial quality control. We find, however, that computer vision research has ignored the small data regime as evaluations increasingly focus on zero- and few-shot learning. We use the Natural World Tasks (NeWT) benchmark to compare multi-modal large language models (MLLMs) and vision-only methods across varying training set sizes. MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples. We provide the first comprehensive comparison between these approaches in small-data contexts and advocate for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments.

Paper Structure

This paper contains 28 sections, 4 figures, 18 tables.

Figures (4)

  • Figure 1: Left: Unique evaluation tasks used in recent language and vision research google2024gemini1.5radford2021clipoquab2023dinov2zhai2023siglipanthropic2024sonnet35microsoft2024phi4tschannen2025siglip2meta2024llama32olmo2024olmo2lambert2024tulu3fini2024aimv2bardes2024vjepagoogle2025gemma3 summarized by the number of training samples per task. Note how few evaluations use between 10.0 and 1000.0 labeled training samples. We collect this data manually. Right: Mean NeWT task performance as a function of number of labeled examples for multimodal large language models (MLLMs) and vision-only models combined with support vector machines (SVMs). MLLMs leverage labeled examples by including additional labeled examples in the prompt (few-shot prompting). Vision models leverage training examples by fitting an SVM to frozen image embeddings. Vision models with SVMs improve with additional training data and consistently outperform MLLMs with 10.0 or more labeled samples. Note the log scale for training data. Shaded areas indicate bootstrapped 95.0% confidence intervals.
  • Figure 2: Performance scaling across NeWT's van2021inat2021 eight task clusters as a function of number of labeled examples. Each panel corresponds to one task cluster (species, attributes, health, ages, gestalt, context, counting, behavior; clusters contain more than one task). Lines depict representative multimodal large language models (MLLMs: Gemini Flash 2.0, Qwen2.5-VL 72B) and vision encoders (CLIP ViT-L/14, DINOv2 ViT-g/14, SigLIP ViT-SO400M/14). Shaded regions represent 95.0% bootstrapped confidence intervals. MLLMs exhibit early performance plateaus compared to sustained improvements seen in vision encoders combined with SVMs as the number of labeled examples. We cannot fit SVMs without at least one labeled example per class; we simulate random chance for 0.0 and 1.0 labeled examples.
  • Figure 3: Left: Vision model performance with respect to inference FLOPs and number of labeled examples ($n$). SigLIP zhai2023siglip released eight different pre-trained transformers with varying model sizes (ViT-B/16, ViT-L/16 and ViT-SO400M/14) and image sizes ($224\times224$, $256\times256$, $384\times384$, and $512\times512$); we unify these axes with FLOPs/image. We find that increasing the number of labeled examples is more effective than increasing the model size; a $10\times$ increase in labeled examples outperforms a $10\times$ increase in FLOPs. Right: Comparing vision model pre-training on performance across the eight task clusters in NeWT for 30.0 labeled examples with ViT-L models. Black error bars indicate bootstrapped 95.0% confidence intervals. Vision-only pre-training oquab2023dinov2 outperforms language-supervised pre-training radford2021clipzhai2023siglip on 'Species' and 'Age' tasks, both of which are fine-grained classification tasks. We observe that language supervision leads to large improvements on 'Gestalt' and 'Behavior' tasks, both of which require semantic reasoning. These conclusions hold for other numbers of labeled examples; see \ref{['app:pretraining']} for additional results.
  • Figure 4: Comparing vision model pre-training on performance across the eight task clusters in NeWT for 3.0, 10.0, 30.0, and 100.0 training samples with ViT-L models.