MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto
TL;DR
MedSAE tackles the interpretability gap in high-performing medical vision-language representations by dissecting the MedCLIP latent space with sparse autoencoders. The approach combines a correlation- and entropy-based assessment of neuron–concept alignment with automated naming via MedGEMMA to ground latent features in clinical terms. On CheXpert, MedSAEs yield higher monosemanticity and identify 21 medically meaningful concepts, outperforming raw MedCLIP features. This work demonstrates a scalable route to transparent, clinically reliable radiology representations by linking high accuracy with mechanistic interpretability.
Abstract
Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.
