Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Junzhang Liu; Zhecan Wang; Hammad Ayyubi; Haoxuan You; Chris Thomas; Rui Sun; Shih-Fu Chang; Kai-Wei Chang

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Junzhang Liu, Zhecan Wang, Hammad Ayyubi, Haoxuan You, Chris Thomas, Rui Sun, Shih-Fu Chang, Kai-Wei Chang

TL;DR

The paper tackles the problem of insufficient event-specific context in Vision-Language Understanding benchmarks, showing that many samples induce baseless predictions. It introduces a model-agnostic Context Selection Module to incorporate contextual evidence when available, and CARA, a multimodal abstention detector, to refrain from answering when context is lacking. By collecting contextual data (CASE) and training a probabilistic context selector, the approach yields consistent gains across VLU benchmarks and demonstrates generalization to unseen datasets. The work advances trustworthy, evidence-grounded VLU by enabling abstention and providing datasets for evaluating context sufficiency. Overall, it offers a practical framework for cleaning and improving VLU benchmarks and real-world deployments where context may be incomplete.

Abstract

Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context. Training models on such data foster biased learning and hallucinations as models tend to make similar unwarranted assumptions. To address this issue, we collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. Strong improvements across multiple benchmarks demonstrate the effectiveness of our approach. Further, we develop a general-purpose Context-AwaRe Abstention (CARA) detector to identify samples lacking sufficient context and enhance model accuracy by abstaining from responding if the required context is absent. CARA exhibits generalization to new benchmarks it wasn't trained on, underscoring its utility for future VLU benchmarks in detecting or cleaning samples with inadequate context. Finally, we curate a Context Ambiguity and Sufficiency Evaluation (CASE) set to benchmark the performance of insufficient context detectors. Overall, our work represents a significant advancement in ensuring that vision-language models generate trustworthy and evidence-based outputs in complex real-world scenarios.

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

TL;DR

Abstract

Paper Structure (54 sections, 2 equations, 12 figures, 8 tables)

This paper contains 54 sections, 2 equations, 12 figures, 8 tables.

Introduction
Related Work
Unanswerable Visual Questions
VLU with External Resources
Abstention in Multimodal Systems
Problem Space
Contextual Data Collection
Data Fetching
Context Retrieval and Filtering
Context Retrieval
Context Filtering
Data Statistics
Training Data
Evaluation Data
Method
...and 39 more sections

Figures (12)

Figure 1: Examples of samples with insufficient context to answer the question across several representative Vision Language Understanding (VLU) benchmarks. "Q" represents the question, "A" stands for the answer, and "P" denotes a typical Vision Language Model (VLM; here BLIP-2) prediction. We find that samples with insufficient context are common across several VLU benchmarks, causing VLMs to hallucinate predictions. Using (wearing) our proposed CARA (hat), VLMs are able to abstain from responding instead of making baseless predictions in such cases.
Figure 2: Illustration of how we obtain contextual data for VCR, Visual SWAG, and VisualCOMET. The video from which the image sample is sourced is identified to obtain temporal context in the form of frames and captions around the image sample in question. The context provides the necessary evidence required to answer these highly semantic questions.
Figure 3: A high-level demonstration of the probabilistic context selection method. For the VLM's input, in addition to the question and image, a context sentence selected by the Context Selection Module is appended to the original input.
Figure 4: Top: We use models with/without context to pseudo-label whether instances need context. The labeled data is then used to train CARA. Bottom: CARA decides whether to abstain based on whether the input contains sufficient context.
Figure 5: (a) Ablation study for context window size. The performance of VL-BERT in accuracy is plotted on y-axis against context window size on x-axis for Visual SWAG (dotted) and VCR (solid). The peak accuracy at window size 3 indicates the optimal size. (b) Ablation study for context selection number. The performance of multiple models (in different colors) is plotted on y-axis against context selection number on x-axis for multiple datasets. We allow the VLM to observe multiple contexts within the window size. The peak in the curve indicates selecting 2 out of 3 contexts results in the best performance.
...and 7 more figures

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

TL;DR

Abstract

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Authors

TL;DR

Abstract

Table of Contents

Figures (12)