Table of Contents
Fetching ...

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto

TL;DR

MedSAE tackles the interpretability gap in high-performing medical vision-language representations by dissecting the MedCLIP latent space with sparse autoencoders. The approach combines a correlation- and entropy-based assessment of neuron–concept alignment with automated naming via MedGEMMA to ground latent features in clinical terms. On CheXpert, MedSAEs yield higher monosemanticity and identify 21 medically meaningful concepts, outperforming raw MedCLIP features. This work demonstrates a scalable route to transparent, clinically reliable radiology representations by linking high accuracy with mechanistic interpretability.

Abstract

Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

TL;DR

MedSAE tackles the interpretability gap in high-performing medical vision-language representations by dissecting the MedCLIP latent space with sparse autoencoders. The approach combines a correlation- and entropy-based assessment of neuron–concept alignment with automated naming via MedGEMMA to ground latent features in clinical terms. On CheXpert, MedSAEs yield higher monosemanticity and identify 21 medically meaningful concepts, outperforming raw MedCLIP features. This work demonstrates a scalable route to transparent, clinically reliable radiology representations by linking high accuracy with mechanistic interpretability.

Abstract

Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.

Paper Structure

This paper contains 13 sections, 5 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: The overall proposed pipeline. (1) We first train MedSAE from MedCLIP vision encoder and extract corresponding embeddings. (2) Then, we compute their Pearson correlation with one-hot encoded vector labels to identify MedSAE neurons-concepts mappings. (3) As final step, for each MedSAE neuron, top-activating images are used to generate concept names via structured prompting. These names are validated through a detection task, where MedGEMMA yields a quantitative measure of semantic alignment.