Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis
Ran Tong, Jiaqi Liu, Tong Wang, Xin Hu, Su Liu, Lanruo Wang, Jiexi Xu
TL;DR
This work compares a lightweight supervised CNN baseline with a zero-shot medical Vision-Language Model (BiomedCLIP) for chest X-ray diagnosis on pneumonia and tuberculosis detection. The authors find that a standard zero-shot approach underperforms the CNN, but applying a threshold calibration on a validation set dramatically improves performance, enabling BiomedCLIP to surpass the CNN for pneumonia and become highly competitive for TB. Key contributions include quantifying the calibration remedy, reporting detailed F1 and ROC AUC results, and highlighting the computation-accuracy trade-off between a tiny CNN and large VLMs. The work underscores the importance of post-processing calibration when deploying large pre-trained models in clinical imaging and suggests paths for few-shot and federated extensions.
Abstract
The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN's 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline's 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.
