Table of Contents
Fetching ...

T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis

Raza Imam, Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub

TL;DR

This work addresses the challenge of fusing pretrained generalist MVLMs with domain-specific expert models under distribution shifts in medical imaging. It introduces T^3, a backpropagation-free, mutual-information-guided test-time merging framework that computes per-sample interpolation weights using the Jensen-Shannon divergence between the two models’ output distributions, and extends this to a batch-wise variant $\mathbb{T^3}_{\mathcal{B}}$ for efficiency. The method achieves state-of-the-art or competitive Top-1 accuracy and reduced corruption errors across four medical modalities, demonstrating robust OOD performance while maintaining practical inference costs via batching and optional coefficient precomputation. The results highlight the value of explicitly modeling model consensus and disagreement for reliable adaptive fusion in clinical MVLM deployment.

Abstract

In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.

T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis

TL;DR

This work addresses the challenge of fusing pretrained generalist MVLMs with domain-specific expert models under distribution shifts in medical imaging. It introduces T^3, a backpropagation-free, mutual-information-guided test-time merging framework that computes per-sample interpolation weights using the Jensen-Shannon divergence between the two models’ output distributions, and extends this to a batch-wise variant for efficiency. The method achieves state-of-the-art or competitive Top-1 accuracy and reduced corruption errors across four medical modalities, demonstrating robust OOD performance while maintaining practical inference costs via batching and optional coefficient precomputation. The results highlight the value of explicitly modeling model consensus and disagreement for reliable adaptive fusion in clinical MVLM deployment.

Abstract

In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.

Paper Structure

This paper contains 23 sections, 13 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Histogram of interpolation coefficients induced by X-entropy ratio $X(x)$ (from Eq. \ref{['eq:XER']}) between pretrained and expert models. For each modality and under three test settings: In‑Domain (data seen to expert during fine‑tuning), Base‑to‑Novel (cross‑dataset generalization), and Corruption inputs. This shows that $X(x)$ coefficient estimates vary greatly and is dependent on different data modality and OOD shifts regarding symmetry and skewness. For instance, in Fundoscopy, $X(x)$ remains tightly clustered for In‑Domain testset but shows strong variation under Base-to-Novel inputs, indicating reduced reliance on the fine‑tuned expert.
  • Figure 2: Pearson correlation $\rho$ between Mutual Information $I(x)$ (Eq. \ref{['eq:mutual_information1']}) and Entropy-ratio $R(x)$ (Eq. \ref{['eq:ER']}). We partition each test set into four groups---TrueTrue, TrueFalse, FalseTrue, and FalseFalse---according to whether the Pretrained and Expert models make correct or incorrect predictions. For each group, we plot Pearson correlation $\rho$ scatter of the entropy ratio $R(x)$ on the x-axis against the Mutual Information $I(x)$. Top row denotes Cell Microscopy PBC (from MediMeta) dataset while Bottom row denotes Breast Imaging Mammo MediMeta dataset with CLIP ViT-B/16 backbone. This correlation implies that $I(x)$ strongly correlates with the $R(x)$ overall across all groups, suggesting a strong alternative interpolation coefficient that could also capture joint predictive confidence better than entropy.
  • Figure 3: Decision‐Quadrant Analysis of Consensus vs. Disagreement via Combined Confidence and JS Divergence. Here M refers to $\bar{p}(x)$ as in Eq. \ref{['eq:mutual_information1']}. While combined confidence alone treats high-confidence OOD samples uniformly---failing to separate agreement from disagreement---JS divergence cleanly isolates high-confidence disagreements, highlighting its superiority as a proxy for joint predictive certainty in model-merging scenarios across diverse modalities.
  • Figure 4: $\mathbb{T^3}$ Test-Time Task Adaptive Merging Workflow. For each input $x$, both pretrained CLIP and domain expert models generate output distributions that are compared using Jensen-Shannon divergence to quantify their agreement. This divergence is transformed into an interpolation coefficient $\lambda(x)$ through sigmoid function, which determines the specific parameter blending for each test sample. Higher disagreement (larger JS divergence) increases the expert model's influence, while agreement favors the pretrained model, enabling adaptive merging that optimizes both accuracy and robustness across distribution shifts.
  • Figure 5: Cross‐Dataset Evaluation Benchmark, depicting In-domain and cross-domain setup for model merging in medical imaging. This illustrates four test conditions: (i) in‐domain MedMNIST chen2021medmnist, (ii) novel‐class samples from MediMeta woerner2024comprehensiveeasytousemultidomainmultitask, (iii) noise corruptions (MedMNIST-C di2024medmnist), and (iv) pixelation corruptions (MedMNIST-C di2024medmnist), for each of the four imaging modalities. See Appendix \ref{['sec:datasets_and_experiments']} for details.
  • ...and 5 more figures