CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Gautier Dagan; Olga Loginova; Anil Batra

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Gautier Dagan, Olga Loginova, Anil Batra

TL;DR

A Cross-modal Alignment Similarity Test (CAST) is proposed to probe VLMs for self-consistency across modalities and argues that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.

Abstract

Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 12 figures, 3 tables)

This paper contains 22 sections, 4 equations, 12 figures, 3 tables.

Introduction
Related Works
Method
Generating Similarities
Evaluating Similarities
Experiments and Results
Dataset
Models
Results
Conclusion
Limitations
Ethical Considerations
Prompts
Generation
Evaluation
...and 7 more sections

Figures (12)

Figure 1: Example of paired scenes and statements from the CAST dataset. Horizontal blocks show generated statements, while vertical blocks are evaluations for each modality: image-only, text-only, and image+text. Red crosses indicate where each model disagrees with its own generation during the evaluation step. Similarity topics are highlighted in bold. Note that VLMs may produce hallucinations, as the CAST method checks for consistency rather than correctness.
Figure 2: CAST is two-fold. In the first step, we ask the model to generate a set of similarity statements conditioned on different modality input types (image-only, text-only, both). In the second step, the model validates the truthfulness of the generated statements with respect to each modality. This allows us to measure whether the VLM is self-consistent within a modality and across different modalities.
Figure 3: Average CAST self-consistency when multiple statements are generated and evaluated within the same modality. Left: Top-1 considers only the first statement generated. Right: Top-3 considers the first three statements generated, these are equivalent to the bolded results from Table \ref{['tab:results']}.
Figure 4: Generation Prompt: For each model and each of the three modalities, we generate a list of similarity statements using the above prompt.
Figure 5: Evaluation Prompts: For each model and each of the three modalities, we generate validate a similarity statement from the generation step. We use three different evaluation prompts to reduce potential bias of models towards a particular prompt format.
...and 7 more figures

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

TL;DR

Abstract

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)