Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Neelabh Sinha; Vinija Jain; Aman Chadha

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Neelabh Sinha, Vinija Jain, Aman Chadha

TL;DR

The paper tackles the challenge of selecting appropriate Vision-Language Models for VQA across diverse tasks and domains by introducing an end-to-end framework that combines the VQA360 dataset with the GoEval GPT-4o-based multimodal metric. VQA360 aggregates task types, application domains, and knowledge types from established benchmarks, yielding a rich, labeled evaluation resource, while GoEval demonstrates stronger alignment with human judgments ($56.71\%$ correlation) than traditional metrics. Experiments across 10 variants of 8 VLMs reveal no universal winner; proprietary Gemini-1.5-Pro and GPT-4o-mini excel in different areas, whereas open-source InternVL-2-8B and CogVLM-2-Llama-3-19B offer competitive strengths and practical benefits. The framework provides actionable insights for task-specific VLM selection and is extensible to other multimodal tasks, supporting more robust, context-aware deployment of VLMs in real-world applications.

Abstract

Visual Question-Answering (VQA) has become key to user experience, particularly after improved generalization capabilities of Vision-Language Models (VLMs). But evaluating VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper aims to solve that using an end-to-end framework. We present VQA360 - a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with state-of-the-art VLMs reveal that no single model excels universally, thus, making a right choice a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive strengths, while providing additional advantages. Our framework can also be extended to other tasks.

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

TL;DR

correlation) than traditional metrics. Experiments across 10 variants of 8 VLMs reveal no universal winner; proprietary Gemini-1.5-Pro and GPT-4o-mini excel in different areas, whereas open-source InternVL-2-8B and CogVLM-2-Llama-3-19B offer competitive strengths and practical benefits. The framework provides actionable insights for task-specific VLM selection and is extensible to other multimodal tasks, supporting more robust, context-aware deployment of VLMs in real-world applications.

Abstract

Paper Structure (17 sections, 6 figures, 11 tables)

This paper contains 17 sections, 6 figures, 11 tables.

Introduction
VQA360: A Practical Evaluation Dataset
Source Datasets
Instance Tags
Generating Instance Tags
GoEval: A VQA Evaluation Metric
Validating GoEval
Comparative Evaluation of VLMs
Correlation Between VLM Outputs
Evaluation on Task Types
Evaluation on Application Domains
Evaluation on Knowledge Types
Overall Analysis and Recommendations
Conclusion
Appendix
...and 2 more sections

Figures (6)

Figure 1: Examples of VQA360 tasks and their labels for task types, application domains, and knowledge type in our dataset.
Figure 2: An example with steps taken to generate the instance tags for application domains and knowledge type (task type is mapped directly from the dataset the image is taken).
Figure 3: Number of task instances per application domain (left) and knowledge type (right) after generating the instance tags using GPT-3.5-Turbo. All categories are represented by approx 400 instances, which is sufficient for a representative analysis. Categories with $<300$ instances are filtered out, and a task instance can be tagged to multiple categories of a single aspect.
Figure 4: Correlation of GoEval-mini values between performance of different VLMs for all task instances. The low correlation values for outputs between all models indicate different VLMs perform differently with task instances.
Figure 5: Mean GoEval-mini scores for different application domains for all VLMs. Gemini-1.5-Pro and GPT-4o-mini are the best performing closed models, with CogVLM-2-LlaMa-3-19B and InternVL-2-8B performing the best amongst open models.
...and 1 more figures

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

TL;DR

Abstract

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Authors

TL;DR

Abstract

Table of Contents

Figures (6)