Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison

Qian Yang; Weixiang Yan; Aishwarya Agrawal

Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison

Qian Yang, Weixiang Yan, Aishwarya Agrawal

TL;DR

Decompose and Compare Consistency (DeCC) is proposed for reliability measurement, comparing the consistency between the direct answer generated using the VLM’s internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM.

Abstract

Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose Decompose and Compare Consistency (DeCC) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM's internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, DeCC measures the reliability of VLM's direct answer. Experiments across six vision-language tasks with three VLMs show DeCC's reliability estimation achieves better correlation with task accuracy compared to the existing methods.

Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison

TL;DR

Abstract

Paper Structure (21 sections, 3 equations, 5 figures, 9 tables)

This paper contains 21 sections, 3 equations, 5 figures, 9 tables.

Introduction
Related Work
Method
Task Decomposition
Consistency Comparison
Experiments
Datasets
Evaluation Metric
Existing Methods Used for Comparison
Main Results
Further Analysis
Decoding Strategy
Question Type Analysis
Additional Analysis
Conclusion
...and 6 more sections

Figures (5)

Figure 1: DeCC begins by decomposing the question into multiple sub-questions. The candidate VLM answers these sub-questions, creating sub-QA pairs. Both the candidate VLM and an LLM independently reason over these pairs to derive reasoned answers. We then compare the direct answer with the reasoned answers to assess reliability. We also explore how different consistency comparison settings impact DeCC's effectiveness.
Figure 2: Illustration of Multi-Agent Consistency Comparison. Top: When both agents' reasoned answers are either consistent or inconsistent with the VLM's direct answer, we directly determine the Reliability. Bottom: If there is a contradiction in consistency check results, we proceed to the second-iteration consistency checks.
Figure 3: Example for the consistent situation. All answers are consistent, thus we assign the direct answer as reliable.
Figure 4: Example for the inconsistent situation. The VLM's reasoned answer is consistent with the direct answer, while the LLM's reasoned answer is inconsistent. Both agents do not change their consistency check results. We trust the LLM's consistency check results and assign the direct answer as unreliable.
Figure 5: Example for the inconsistent situation. All answers are inconsistent, while none of these answers are correct, indicating the VLMs do not understand the question well. We assign the direct answer as unreliable.

Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison

TL;DR

Abstract

Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison

Authors

TL;DR

Abstract

Table of Contents

Figures (5)