Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models
Teppei Suzuki, Keisuke Ozawa
TL;DR
Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models addresses the high cost of evaluating VLMs across many benchmarks by introducing ResampledBench, an FPS-based subset that preserves model rankings with a correlation above $0.96$ to full benchmark evaluations while using about $1\%$ of the data. The method relies on a joint image-text feature space and demonstrates that no single benchmark fully covers the evaluation space, while FPS-based sampling reduces redundancy and can mitigate dataset biases. The approach yields practical efficiency gains (≈100×) and improved bias mitigation when filtering established benchmarks like MMStar. Overall, it provides a scalable framework for robust, biased-aware VLM evaluation and points to future work in learning a true multimodal embedding space to further improve benchmarking.
Abstract
We propose an efficient evaluation protocol for large vision-language models (VLMs). Given their broad knowledge and reasoning capabilities, multiple benchmarks are needed for comprehensive assessment, making evaluation computationally expensive. To improve efficiency, we construct a subset that yields results comparable to full benchmark evaluations. Our benchmark classification experiments reveal that no single benchmark fully covers all challenges. We then introduce a subset construction method using farthest point sampling (FPS). Our experiments show that FPS-based benchmarks maintain a strong correlation (> 0.96) with full evaluations while using only ~1\% of the data. Additionally, applying FPS to an existing benchmark improves correlation with overall evaluation results, suggesting its potential to reduce unintended dataset biases.
