How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

Saeid Asgari Taghanaki; Joseph Lambourne; Alana Mongkhounsavath

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

Saeid Asgari Taghanaki, Joseph Lambourne, Alana Mongkhounsavath

TL;DR

A novel, generalizable methodology to identify preferred image distributions for black-box Vision-Language Models (VLMs) by measuring output consistency across varied input prompts is proposed, providing a framework for advancing VLM capabilities in complex visual reasoning tasks across various fields requiring expert-level visual interpretation.

Abstract

Large foundation models have revolutionized the field, yet challenges remain in optimizing multi-modal models for specialized visual tasks. We propose a novel, generalizable methodology to identify preferred image distributions for black-box Vision-Language Models (VLMs) by measuring output consistency across varied input prompts. Applying this to different rendering types of 3D objects, we demonstrate its efficacy across various domains requiring precise interpretation of complex structures, with a focus on Computer-Aided Design (CAD) as an exemplar field. We further refine VLM outputs using in-context learning with human feedback, significantly enhancing explanation quality. To address the lack of benchmarks in specialized domains, we introduce CAD-VQA, a new dataset for evaluating VLMs on CAD-related visual question answering tasks. Our evaluation of state-of-the-art VLMs on CAD-VQA establishes baseline performance levels, providing a framework for advancing VLM capabilities in complex visual reasoning tasks across various fields requiring expert-level visual interpretation. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/cad_vqa}.

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 2 figures, 4 tables)

This paper contains 16 sections, 5 equations, 2 figures, 4 tables.

Introduction
Related Work
Method
Measuring Consistency
Human Expert Rating and Dataset Creation
CAD-VQA Dataset
Dataset Creation Process
Results
In-Context Learning with Human Feedback
Performance of State-of-the-Art VLMs on our CAD-VQA dataset
Conclusion
Supplementary Material
Paraphrasing Prompt
Consistency Judgment Prompt
In-Context Learning with Human Feedback Prompt
...and 1 more sections

Figures (2)

Figure 1: Sample data visualization showing different image distributions and generated explanations. First, second and third row correspond to distributions A, B, and C of the same object, respectively.
Figure 2: Visual comparison of algorithms for processing large datasets

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

TL;DR

Abstract

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

Authors

TL;DR

Abstract

Table of Contents

Figures (2)