Table of Contents
Fetching ...

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane Bouchacourt, Caner Hazirbas, Mark Ibrahim

TL;DR

UniBench provides a unified benchmarking framework for vision-language models by aggregating 53 benchmarks and evaluating nearly 60 models, revealing that scaling data or parameters improves many tasks but offers limited gains for reasoning and relational capabilities. The work shows surprising weaknesses on simple digit recognition and counting tasks, underscoring the importance of data quality and tailored learning objectives over sheer scale. It proposes practical guidance for model selection and introduces a distilled, fast subset of benchmarks enabling rapid 5-minute evaluations on a single GPU. The open-source UniBench codebase aims to standardize, accelerate, and broaden comprehensive VLM evaluation, reducing blind spots and guiding progress.

Abstract

Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application. Finally, we release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled, representative set of benchmarks that runs in 5 minutes on a single GPU.

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

TL;DR

UniBench provides a unified benchmarking framework for vision-language models by aggregating 53 benchmarks and evaluating nearly 60 models, revealing that scaling data or parameters improves many tasks but offers limited gains for reasoning and relational capabilities. The work shows surprising weaknesses on simple digit recognition and counting tasks, underscoring the importance of data quality and tailored learning objectives over sheer scale. It proposes practical guidance for model selection and introduces a distilled, fast subset of benchmarks enabling rapid 5-minute evaluations on a single GPU. The open-source UniBench codebase aims to standardize, accelerate, and broaden comprehensive VLM evaluation, reducing blind spots and guiding progress.

Abstract

Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application. Finally, we release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled, representative set of benchmarks that runs in 5 minutes on a single GPU.
Paper Structure (32 sections, 12 figures, 6 tables)

This paper contains 32 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Benchmark Types in UniBench with their respective performance gains from scaling model size and training dataset size. Scale offers limited benefits for relational understanding and reasoning tasks. Example images were acquired from https://unsplash.com/
  • Figure 2: Median performance of all 59 VLMs on 53 benchmarks, illustrating despite many advances, VLMs still struggle on several benchmarks. Several benchmarks such as Winoground, iNaturalist, DSPR, Small Norb, dmlab, Clevr, PCam, Renderedssst2, and Kitti hardly exceed chance level performance. Blue bars represent the median zero-shot performance of the models; grey bars indicate chance-level performance for each benchmark.
  • Figure 3: The effect of scaling model and training dataset size using a fixed architecture and learning paradigm. Zero-shot performance of models on various benchmark types. We investigate the impact of training dataset size (left), and model size on various benchmark types (right). To isolate the effect of scale, we fix the architecture, learning paradigm, model size (for right panel), and training dataset size (for left) by using the same CLIP ViT-B/32 model and LAION 400M dataset, respectively. We observe a similar trend when measured across all 59 models as shown in Appendix \ref{['fig:results_summary_dataset_type_all']}
  • Figure 4: The effect scaling of training dataset (left) and model size (right) across capabilities for all models. Accuracy is the difference in performance between the most scaled and the least scaled model across capabilities relative to ImageNet performance.
  • Figure 5: Performance of 59 VLMs on MNIST, showing despite progress, VLMs still struggle on MNIST. Blue bars represent zero-shot performance of models, grey bars represent the chance-level for MNIST, and green bar shows performance for a 2-Layer MLP.
  • ...and 7 more figures