ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies
Costin F. Ciusdel, Alex Serban, Tiziano Passerini
TL;DR
ConceptVAE presents a self-supervised framework that disentangles input images into a grid of discrete concepts (content) and continuous local styles, enabling fine-grained interpretation and improved downstream performance on 2D echocardiography. By combining a concept discretizer with a style stylizer within a VAE-like reconstruction setup and enforcing cross-view consistency, priors, and region-coherent concept islands, the approach yields semantically meaningful region descriptors and robust generalization. Quantitative results show gains over traditional SSL baselines on region-based retrieval, semantic segmentation, near-OOD detection, and AV localization, while style-based generation offers calibrated texture variation without altering anatomy. The work suggests substantial potential for more interpretable, region-aware pre-training in medical imaging and points to extensions to more modalities and 3D data with automated concept-count selection.
Abstract
While traditional self-supervised learning methods improve performance and robustness across various medical tasks, they rely on single-vector embeddings that may not capture fine-grained concepts such as anatomical structures or organs. The ability to identify such concepts and their characteristics without supervision has the potential to improve pre-training methods, and enable novel applications such as fine-grained image retrieval and concept-based outlier detection. In this paper, we introduce ConceptVAE, a novel pre-training framework that detects and disentangles fine-grained concepts from their style characteristics in a self-supervised manner. We present a suite of loss terms and model architecture primitives designed to discretise input data into a preset number of concepts along with their local style. We validate ConceptVAE both qualitatively and quantitatively, demonstrating its ability to detect fine-grained anatomical structures such as blood pools and septum walls from 2D cardiac echocardiographies. Quantitatively, ConceptVAE outperforms traditional self-supervised methods in tasks such as region-based instance retrieval, semantic segmentation, out-of-distribution detection, and object detection. Additionally, we explore the generation of in-distribution synthetic data that maintains the same concepts as the training data but with distinct styles, highlighting its potential for more calibrated data generation. Overall, our study introduces and validates a promising new pre-training technique based on concept-style disentanglement, opening multiple avenues for developing models for medical image analysis that are more interpretable and explainable than black-box approaches.
