Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

Salma J. Ahmed; Emad A. Mohammed; Azam Asilian Bidgoli

Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

Salma J. Ahmed, Emad A. Mohammed, Azam Asilian Bidgoli

TL;DR

Med-SegLens addresses opacity and dataset shift in medical image segmentation by decomposing internal activations into sparse latent features with SAEs and aligning them across datasets and architectures. The approach reveals a backbone of shared, population-invariant latents and population-specific bottlenecks that causally drive failures, enabling targeted latent-level interventions to recover performance without retraining. It demonstrates substantial gains, recovering roughly 70% of failure cases and elevating edema-related Dice from 39.4% to 74.2%, while enabling cross-dataset adaptation through additive or multiplicative latent steering. The framework provides a practical, mechanistic path for model auditing, failure diagnosis, and equitable, robust deployment in heterogeneous clinical populations, with potential applicability beyond medical imaging.

Abstract

Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce Med-SegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.

Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

TL;DR

Abstract

Paper Structure (46 sections, 13 equations, 14 figures, 9 tables, 1 algorithm)

This paper contains 46 sections, 13 equations, 14 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Background and Methods
Problem Setting: Representation-Level Model Diffing
Population Knowledge and Datasets
Controlled Model Training
Latent Extraction with Sparse Autoencoders
Automated Latent Semantics Discovery
Cross-Dataset Model Diffing via Latent Alignment
Latent Feature Analysis
SAEs Uncover Interpretable Features
Cross-Architecture Internal Reasoning
Do Models Learn Shared Representations Across Datasets?
Model Diffing via Sparse Autoencoders
Shared and Dataset-Specific Representations
...and 31 more sections

Figures (14)

Figure 2: Examples of SAE features From Segformer and UNET. Shown are representative SAE latents corresponding to tumor subregions (edema, enhancing tumor, and necrotic core), capturing diverse spatial and morphological patterns, alongside features associated with healthy anatomy. Each latent is accompanied by its auto-interpreted semantic description. The top row shows the MRI image and ground-truth mask, and the bottom row visualizes the SAE activation heatmap overlaid on the MRI and segmentation mask.
Figure 3: Distribution of active SAE latents across semantic categories for SegFormer and UNet on 100 BraTS-Adult cases, grouped by automated semantic interpretation.
Figure 4: Feature swapping between Adult and Pediatric SAEs. Shared latents (left) preserve activation structure and spatial correlation, while non-shared latents (right) produce disrupted or absent activations.
Figure 5: Mean Dice versus steering strength ($\alpha$) for adult-specific (left), universally shared (middle), and random (right) SAE latents. Adult-specific steering strongly affects Adult performance and weakly affects Pediatric and SSA, while shared or random steering yields minimal, non-selective changes.
Figure 6: Effect of steering the most activated latent on failure cases across Pediatric (PED), Sub-Saharan African (SSA), and Adult glioma datasets. Amplifying these latents with increasing scaling strength improves the mean Dice score and increases the number of recovered cases.
...and 9 more figures

Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

TL;DR

Abstract

Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)