Table of Contents
Fetching ...

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Arpita Chowdhury, Zheda Mai, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Wei-Lun Chao

Abstract

The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Abstract

The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

Paper Structure

This paper contains 48 sections, 3 equations, 27 figures, 8 tables.

Figures (27)

  • Figure 1: Vision foundation models (VFMs) trained with different data and objectives are evaluated on the proposed AVA-Bench to assess their strengths and limitations across atomic visual abilities (AVAs).
  • Figure 2: Visual Question Answering (VQA) often requires multiple atomic visual abilities to answer a question. When a model makes an incorrect prediction, it's hard to determine whether it stems from a failure to capture all required AVAs or just a single critical one.
  • Figure 3: AVA-Bench consists of 14 Atomic Visual Abilities that can be combined to address more complex visual reasoning tasks.
  • Figure 4: (a) Evaluation pipeline for AVA-Bench: The standard LLaVA-style two-stage training prepares the connector and LLM for VFM evaluation. For each AVA, only connector and LoRA is trained. (b) Overall statistics of AVA-Bench.
  • Figure 5: Performance comparison of VFMs across all AVAs. (Left) Language-Supervised VFMs with DINOv2 as a reference. (Right) Other VFMs with the SigLIP-2 as a reference.
  • ...and 22 more figures