Table of Contents
Fetching ...

Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis

Ran Tong, Jiaqi Liu, Tong Wang, Xin Hu, Su Liu, Lanruo Wang, Jiexi Xu

TL;DR

This work compares a lightweight supervised CNN baseline with a zero-shot medical Vision-Language Model (BiomedCLIP) for chest X-ray diagnosis on pneumonia and tuberculosis detection. The authors find that a standard zero-shot approach underperforms the CNN, but applying a threshold calibration on a validation set dramatically improves performance, enabling BiomedCLIP to surpass the CNN for pneumonia and become highly competitive for TB. Key contributions include quantifying the calibration remedy, reporting detailed F1 and ROC AUC results, and highlighting the computation-accuracy trade-off between a tiny CNN and large VLMs. The work underscores the importance of post-processing calibration when deploying large pre-trained models in clinical imaging and suggests paths for few-shot and federated extensions.

Abstract

The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN's 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline's 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.

Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis

TL;DR

This work compares a lightweight supervised CNN baseline with a zero-shot medical Vision-Language Model (BiomedCLIP) for chest X-ray diagnosis on pneumonia and tuberculosis detection. The authors find that a standard zero-shot approach underperforms the CNN, but applying a threshold calibration on a validation set dramatically improves performance, enabling BiomedCLIP to surpass the CNN for pneumonia and become highly competitive for TB. Key contributions include quantifying the calibration remedy, reporting detailed F1 and ROC AUC results, and highlighting the computation-accuracy trade-off between a tiny CNN and large VLMs. The work underscores the importance of post-processing calibration when deploying large pre-trained models in clinical imaging and suggests paths for few-shot and federated extensions.

Abstract

The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN's 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline's 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.

Paper Structure

This paper contains 20 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Sample test images from PneumoniaMNIST with predicted probabilities for the pneumonia class (p1) from the trained CNN and BiomedCLIP (BioCLIP). The top row shows normal cases (y=0) and the bottom row shows pneumonia cases (y=1). All probability color maps use a common 0--1 legend for visual comparability.
  • Figure 2: Grad-CAM visualizations for the CNN trained on PneumoniaMNIST. The heatmaps (bottom row) highlight the image regions most influential for predicting pneumonia. Red indicates high importance. The model correctly focuses on lung fields in pneumonia cases (right). This alignment between Grad-CAM heat maps and lung fields supports the clinical plausibility of CNN predictions.