Table of Contents
Fetching ...

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Doan Nam Long Vu, Simone Balloccu

Abstract

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Abstract

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

Paper Structure

This paper contains 56 sections, 12 equations, 10 figures, 24 tables.

Figures (10)

  • Figure 1: Overview of the proposed VLM pipeline. On FOR2107 and OASIS-3 we only change the label from MDD to Cognitive Decline and Control to Cognitive Normal.
  • Figure 2: F1 Score on 2 datasets OASIS-3 and FOR2107 over 5 different modes.
  • Figure 3: Group mean $\hat{P}(\text{MDD})$ across the three input conditions. The black line and markers show the group mean per condition. The shaded band indicates $\pm 1$ STD. The dashed line marks the decision boundary at $0.5$.
  • Figure 4: Cosine similarity to the scaffold direction vs. $\delta^{\textsc{text(arcf)}+\textsc{prompt}}_{\leftarrow\,\textsc{text(arcf)}}$ for candidate phrases across semantic categories, evaluated on Qwen2.5-VL-3B over the FOR2107 cohort. Phrases in the top-right quadrant activate the same residual-stream direction as text(arcf)$+$prompt(mri) without providing any imaging data.
  • Figure 5: text(arcf) prompt for FOR2107
  • ...and 5 more figures