Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation
Salma J. Ahmed, Emad A. Mohammed, Azam Asilian Bidgoli
TL;DR
Med-SegLens addresses opacity and dataset shift in medical image segmentation by decomposing internal activations into sparse latent features with SAEs and aligning them across datasets and architectures. The approach reveals a backbone of shared, population-invariant latents and population-specific bottlenecks that causally drive failures, enabling targeted latent-level interventions to recover performance without retraining. It demonstrates substantial gains, recovering roughly 70% of failure cases and elevating edema-related Dice from 39.4% to 74.2%, while enabling cross-dataset adaptation through additive or multiplicative latent steering. The framework provides a practical, mechanistic path for model auditing, failure diagnosis, and equitable, robust deployment in heterogeneous clinical populations, with potential applicability beyond medical imaging.
Abstract
Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce Med-SegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.
