Table of Contents
Fetching ...

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Zaid Khan, Yun Fu

TL;DR

This work tackles the reliability of API-only black-box vision-language models for visual question answering by introducing neighborhood consistency across rephrasings generated by a probing visual question generator. By evaluating consistency of the black-box model's answers over rephrasings conditioned on the same image, the approach uncovers a reliable uncertainty signal that differs from raw confidence, improving selective abstention on in-distribution, out-of-distribution, and adversarial data. The study demonstrates that higher consistency correlates with higher accuracy and yields better risk-coverage trade-offs, even when the rephraser is significantly smaller than the black-box model. This method enables safer, more reliable deployment of large VLM APIs by providing a practical, training-free mechanism to identify when the model may not know the answer. Overall, neighborhood consistency offers a scalable path toward robust, selective VQA in real-world, black-box scenarios.

Abstract

The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of \textit{neighborhood consistency} to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model.

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

TL;DR

This work tackles the reliability of API-only black-box vision-language models for visual question answering by introducing neighborhood consistency across rephrasings generated by a probing visual question generator. By evaluating consistency of the black-box model's answers over rephrasings conditioned on the same image, the approach uncovers a reliable uncertainty signal that differs from raw confidence, improving selective abstention on in-distribution, out-of-distribution, and adversarial data. The study demonstrates that higher consistency correlates with higher accuracy and yields better risk-coverage trade-offs, even when the rephraser is significantly smaller than the black-box model. This method enables safer, more reliable deployment of large VLM APIs by providing a practical, training-free mechanism to identify when the model may not know the answer. Overall, neighborhood consistency offers a scalable path toward robust, selective VQA in real-world, black-box scenarios.

Abstract

The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of \textit{neighborhood consistency} to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model.
Paper Structure (23 sections, 2 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 2 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: Identifying unreliable responses from an API-only black-box vision-language model (VLM) can be challenging because confidence scores are not always trustworthy, and more sophisticated methods for selective prediction require a level of access to the model that is unavailable. We explore the idea of model consistency to identify unreliable model responses in this realistic scenario: a reliable response is one that is consistent across questions that are semantically equivalent but different on the surface.
  • Figure 2: For out of distribution (OKVQA) and adversarial visual (AdVQA) questions, confidence scores alone do not work well to separate right from wrong answers --- many correct answers are low confidence for OOD data, and many wrong answers are high confidence for adversarial data. Note: Displayed confidence scores are raw. See Appendix for discussion on calibration.
  • Figure 3: Selective VQA performance of a VLM (BLIP) on three datasets: adversarial (AdVQA), out-of-distribution (OKVQA), and in-distribution (VQAv2). On OOD and adversarial questions, the model has a harder time identifying which questions it should abstain from.
  • Figure 4: Examples showing the use of model-generated rephrasings to identify errors in model predictions with BLIP as the black box model $f_{BB}$. In the left panel, we show high-confidence answers that wrong, and identified by their low consistency across rephrasings. In the right panel, we show low-confidence answers that are actually correct, identified by their high-confidence across rephrasings.
  • Figure 5: The distribution of confidence scores of $f_{BB}$ at each level of consistency. While higher levels of consistency have a larger proportion of high confidence answers, they also retain a large number of low confidence answers, showing that consistency defines a different ordering over questions than confidence scores alone. BLIP is used as the black-box model $f_{BB}$.
  • ...and 7 more figures