T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis
Raza Imam, Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub
TL;DR
This work addresses the challenge of fusing pretrained generalist MVLMs with domain-specific expert models under distribution shifts in medical imaging. It introduces T^3, a backpropagation-free, mutual-information-guided test-time merging framework that computes per-sample interpolation weights using the Jensen-Shannon divergence between the two models’ output distributions, and extends this to a batch-wise variant $\mathbb{T^3}_{\mathcal{B}}$ for efficiency. The method achieves state-of-the-art or competitive Top-1 accuracy and reduced corruption errors across four medical modalities, demonstrating robust OOD performance while maintaining practical inference costs via batching and optional coefficient precomputation. The results highlight the value of explicitly modeling model consensus and disagreement for reliable adaptive fusion in clinical MVLM deployment.
Abstract
In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.
