Table of Contents
Fetching ...

Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations

Sihao Ding, Santosh Vasa, Aditi Ramadwar

TL;DR

This work tackles the gap between plausibility and faithfulness in vision-language model explanations by introducing Explanation-Driven Counterfactual Testing (EDCT). EDCT treats the model’s own NLE as a falsifiable hypothesis, extracting visual concepts, generating targeted counterfactual edits with diffusion-based tools, and scoring faithfulness via a three-part metric: Prediction Change Score ($PCS$), NLE Concept Consistency ($NCC$), and Counterfactual Consistency Score ($CCS$), where $CCS_i = PCS_i \cdot NCC_i$ and $CCS = \frac{1}{k}\sum_i CCS_i$. Evaluated on 120 OK-VQA examples across multiple VLMs, EDCT reveals substantial faithfulness gaps, with Gemini 2.5 Flash achieving the highest scores and ablation showing concept extraction and judge LLM quality as the main performance drivers. The approach supports regulator-aligned auditing by delivering traceable prompts, seeds, masks, and rationales, and points to future extensions to more complex modalities and improved edit controls.

Abstract

Vision-Language Models (VLMs) often produce fluent Natural Language Explanations (NLEs) that sound convincing but may not reflect the causal factors driving predictions. This mismatch of plausibility and faithfulness poses technical and governance risks. We introduce Explanation-Driven Counterfactual Testing (EDCT), a fully automated verification procedure for a target VLM that treats the model's own explanation as a falsifiable hypothesis. Given an image-question pair, EDCT: (1) obtains the model's answer and NLE, (2) parses the NLE into testable visual concepts, (3) generates targeted counterfactual edits via generative inpainting, and (4) computes a Counterfactual Consistency Score (CCS) using LLM-assisted analysis of changes in both answers and explanations. Across 120 curated OK-VQA examples and multiple VLMs, EDCT uncovers substantial faithfulness gaps and provides regulator-aligned audit artifacts indicating when cited concepts fail causal tests.

Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations

TL;DR

This work tackles the gap between plausibility and faithfulness in vision-language model explanations by introducing Explanation-Driven Counterfactual Testing (EDCT). EDCT treats the model’s own NLE as a falsifiable hypothesis, extracting visual concepts, generating targeted counterfactual edits with diffusion-based tools, and scoring faithfulness via a three-part metric: Prediction Change Score (), NLE Concept Consistency (), and Counterfactual Consistency Score (), where and . Evaluated on 120 OK-VQA examples across multiple VLMs, EDCT reveals substantial faithfulness gaps, with Gemini 2.5 Flash achieving the highest scores and ablation showing concept extraction and judge LLM quality as the main performance drivers. The approach supports regulator-aligned auditing by delivering traceable prompts, seeds, masks, and rationales, and points to future extensions to more complex modalities and improved edit controls.

Abstract

Vision-Language Models (VLMs) often produce fluent Natural Language Explanations (NLEs) that sound convincing but may not reflect the causal factors driving predictions. This mismatch of plausibility and faithfulness poses technical and governance risks. We introduce Explanation-Driven Counterfactual Testing (EDCT), a fully automated verification procedure for a target VLM that treats the model's own explanation as a falsifiable hypothesis. Given an image-question pair, EDCT: (1) obtains the model's answer and NLE, (2) parses the NLE into testable visual concepts, (3) generates targeted counterfactual edits via generative inpainting, and (4) computes a Counterfactual Consistency Score (CCS) using LLM-assisted analysis of changes in both answers and explanations. Across 120 curated OK-VQA examples and multiple VLMs, EDCT uncovers substantial faithfulness gaps and provides regulator-aligned audit artifacts indicating when cited concepts fail causal tests.

Paper Structure

This paper contains 13 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Counterfactual generation process for Explanation-Driven Counterfactual Testing.
  • Figure :