Table of Contents
Fetching ...

Visual concept ranking uncovers medical shortcuts used by large multimodal models

Joseph D. Janizek, Sonnet Xu, Junayd Lateef, Roxana Daneshjou

TL;DR

The paper proposes Visual Concept Ranking (VCR), a causal-audit approach for large multimodal models that uses a vision–language model to label a probe image set with concept scores, learns Concept Activation Vectors from LMM activations, and ranks concepts by their directional-derivative sensitivity on the task score. It validates VCR with synthetic benchmarks, showing robustness to distribution shift and a strong link between VCR sensitivity and true interventional effects; it then applies VCR to malignant skin-lesion classification, revealing demographic shortcuts (e.g., blue/purple ink markings) and background-related biases that affect predictions, which are confirmed via manual interventions. The study also demonstrates VCR’s applicability beyond dermatology (CheXpert, Imagenette) and discusses limitations like gradient access requirements and semantic-label noise, while outlining future work such as combining with activation steering and improved spatial concept encoding. Overall, VCR provides a causal, scalable, and interpretable framework for auditing LMMs in safety-critical domains, enabling hypothesis generation and targeted interventions to improve reliability and fairness.

Abstract

Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.

Visual concept ranking uncovers medical shortcuts used by large multimodal models

TL;DR

The paper proposes Visual Concept Ranking (VCR), a causal-audit approach for large multimodal models that uses a vision–language model to label a probe image set with concept scores, learns Concept Activation Vectors from LMM activations, and ranks concepts by their directional-derivative sensitivity on the task score. It validates VCR with synthetic benchmarks, showing robustness to distribution shift and a strong link between VCR sensitivity and true interventional effects; it then applies VCR to malignant skin-lesion classification, revealing demographic shortcuts (e.g., blue/purple ink markings) and background-related biases that affect predictions, which are confirmed via manual interventions. The study also demonstrates VCR’s applicability beyond dermatology (CheXpert, Imagenette) and discusses limitations like gradient access requirements and semantic-label noise, while outlining future work such as combining with activation steering and improved spatial concept encoding. Overall, VCR provides a causal, scalable, and interpretable framework for auditing LMMs in safety-critical domains, enabling hypothesis generation and targeted interventions to improve reliability and fairness.

Abstract

Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.
Paper Structure (58 sections, 8 equations, 27 figures, 1 table)

This paper contains 58 sections, 8 equations, 27 figures, 1 table.

Figures (27)

  • Figure 1: Our visual concept ranking method (VCR) identifies visual concepts that have a statistically significant impact on an LMM's output for a particular task or prompt. This concept figure illustrates a hypothetical binary classification prompt for a set of clinical dermatology images, tested against tens of thousands of visual concepts.
  • Figure 2: Left, Example images from the synthetic datasets. Right, For two synthetic feature pairs (Square/Circle, and Empty/Filled), the relationship between VCR sensitivity score and measured interventional effect. Each point represents an OpenFlamingo-4B model fine-tuned on one of ten bootstrap replicates of one of five different training sets with different feature-label correlation levels.
  • Figure 3: Summary of VCR-intervention correlations. For each feature pair tested in an experiment with fine-tuned OpenFlamingo-4B models, except for the two pairs of feature related to position within the image, the overall correlation between VCR sensitivity and ground-truth measured interventional effect had Pearson's $r > 0.6$. Neither feature related to position within the image was significantly positively correlated.
  • Figure 4: VCR correctly identifies the causal effect of a spuriously correlated visual feature 92% of the time, as compared to a CLIP-only approach that does not use model internals (akin to MA-MONET) and only identifies the correct direction 18% of the time. Each point represents an OpenFlamingo-4B model fine-tuned on one of ten bootstrap replicates of one of six different feature-pair synthetic datasets.
  • Figure 5: Predictive performance of OpenFlamingo-3B-Instruct for skin lesion malignancy classification across skin type subgroups as additional demonstrating examples are added in-context. Points represent the mean and shading represents one standard deviation over 3 replicates (where the specific demonstrating examples are re-sampled across replicates).
  • ...and 22 more figures