OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

Ling Lin; Yang Bai; Heng Su; Congcong Zhu; Yaoxing Wang; Yang Zhou; Huazhu Fu; Jingrun Chen

OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

Ling Lin, Yang Bai, Heng Su, Congcong Zhu, Yaoxing Wang, Yang Zhou, Huazhu Fu, Jingrun Chen

TL;DR

OODBench is proposed, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data, and a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully.

Abstract

Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.

OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

TL;DR

Abstract

Paper Structure (35 sections, 37 equations, 21 figures, 18 tables, 1 algorithm)

This paper contains 35 sections, 37 equations, 21 figures, 18 tables, 1 algorithm.

Introduction
Related Work
Existing Definitions and Formulations
The OODBench
Experiment
Selected Datasets
Evaluation Setup
Main Results Analysis
Does Chain-of-Thought Reasoning Work on OOD Data?
BAP Evaluation of ID and OOD Data Differences
Analysis of Error Cases
Conclusion
Impact Statements
Collection Details
Statistical Evidence for the OOD Nature of OODBench in Modern MLLMs
...and 20 more sections

Figures (21)

Figure 1: Comparison of differences in ID data, covariate shift OOD data, and semantic shift data.
Figure 2: Distribution of categories and fields in OODBench.
Figure 3: Pipeline of OOD data collection.
Figure 4: Basic-to-Advanced Progression Metric Example. Basic-to-Advanced Progression Metric covers three problems: existential, counting, and logical reasoning problems from top to bottom.
Figure 5: The OODBench example has two questions and one instance, with the answers alternated to avoid the model being over-scored due to biased output distribution.
...and 16 more figures

OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

TL;DR

Abstract

OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (21)