Mind the (Data) Gap: Evaluating Vision Systems in Small Data Applications
Samuel Stevens, S M Rayeed, Jenna Kline
TL;DR
This work argues that the small-data regime—where tens to thousands of labeled samples are available—is essential for real-world vision applications but underexplored in current AI research. Using the NeWT ecological benchmark, it compares multimodal large language models (MLLMs) against vision encoders paired with SVMs across varying data sizes, revealing that MLLMs rapidly improve at very small data but plateau after roughly 10–30 examples, while vision-only methods continue to scale with more data. The study provides the first systematic small-data comparison between these approaches, showing distinct scaling behaviors and highlighting how pre-training supervision differently shapes representations for fine-grained versus semantic tasks. The findings advocate for explicit small-data evaluations in AI benchmarks to better align research progress with deployment needs and practical constraints.
Abstract
The practical application of AI tools for specific computer vision tasks relies on the "small-data regime" of hundreds to thousands of labeled samples. This small-data regime is vital for applications requiring expensive expert annotations, such as ecological monitoring, medical diagnostics or industrial quality control. We find, however, that computer vision research has ignored the small data regime as evaluations increasingly focus on zero- and few-shot learning. We use the Natural World Tasks (NeWT) benchmark to compare multi-modal large language models (MLLMs) and vision-only methods across varying training set sizes. MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples. We provide the first comprehensive comparison between these approaches in small-data contexts and advocate for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments.
