Table of Contents
Fetching ...

ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies

Costin F. Ciusdel, Alex Serban, Tiziano Passerini

TL;DR

ConceptVAE presents a self-supervised framework that disentangles input images into a grid of discrete concepts (content) and continuous local styles, enabling fine-grained interpretation and improved downstream performance on 2D echocardiography. By combining a concept discretizer with a style stylizer within a VAE-like reconstruction setup and enforcing cross-view consistency, priors, and region-coherent concept islands, the approach yields semantically meaningful region descriptors and robust generalization. Quantitative results show gains over traditional SSL baselines on region-based retrieval, semantic segmentation, near-OOD detection, and AV localization, while style-based generation offers calibrated texture variation without altering anatomy. The work suggests substantial potential for more interpretable, region-aware pre-training in medical imaging and points to extensions to more modalities and 3D data with automated concept-count selection.

Abstract

While traditional self-supervised learning methods improve performance and robustness across various medical tasks, they rely on single-vector embeddings that may not capture fine-grained concepts such as anatomical structures or organs. The ability to identify such concepts and their characteristics without supervision has the potential to improve pre-training methods, and enable novel applications such as fine-grained image retrieval and concept-based outlier detection. In this paper, we introduce ConceptVAE, a novel pre-training framework that detects and disentangles fine-grained concepts from their style characteristics in a self-supervised manner. We present a suite of loss terms and model architecture primitives designed to discretise input data into a preset number of concepts along with their local style. We validate ConceptVAE both qualitatively and quantitatively, demonstrating its ability to detect fine-grained anatomical structures such as blood pools and septum walls from 2D cardiac echocardiographies. Quantitatively, ConceptVAE outperforms traditional self-supervised methods in tasks such as region-based instance retrieval, semantic segmentation, out-of-distribution detection, and object detection. Additionally, we explore the generation of in-distribution synthetic data that maintains the same concepts as the training data but with distinct styles, highlighting its potential for more calibrated data generation. Overall, our study introduces and validates a promising new pre-training technique based on concept-style disentanglement, opening multiple avenues for developing models for medical image analysis that are more interpretable and explainable than black-box approaches.

ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies

TL;DR

ConceptVAE presents a self-supervised framework that disentangles input images into a grid of discrete concepts (content) and continuous local styles, enabling fine-grained interpretation and improved downstream performance on 2D echocardiography. By combining a concept discretizer with a style stylizer within a VAE-like reconstruction setup and enforcing cross-view consistency, priors, and region-coherent concept islands, the approach yields semantically meaningful region descriptors and robust generalization. Quantitative results show gains over traditional SSL baselines on region-based retrieval, semantic segmentation, near-OOD detection, and AV localization, while style-based generation offers calibrated texture variation without altering anatomy. The work suggests substantial potential for more interpretable, region-aware pre-training in medical imaging and points to extensions to more modalities and 3D data with automated concept-count selection.

Abstract

While traditional self-supervised learning methods improve performance and robustness across various medical tasks, they rely on single-vector embeddings that may not capture fine-grained concepts such as anatomical structures or organs. The ability to identify such concepts and their characteristics without supervision has the potential to improve pre-training methods, and enable novel applications such as fine-grained image retrieval and concept-based outlier detection. In this paper, we introduce ConceptVAE, a novel pre-training framework that detects and disentangles fine-grained concepts from their style characteristics in a self-supervised manner. We present a suite of loss terms and model architecture primitives designed to discretise input data into a preset number of concepts along with their local style. We validate ConceptVAE both qualitatively and quantitatively, demonstrating its ability to detect fine-grained anatomical structures such as blood pools and septum walls from 2D cardiac echocardiographies. Quantitatively, ConceptVAE outperforms traditional self-supervised methods in tasks such as region-based instance retrieval, semantic segmentation, out-of-distribution detection, and object detection. Additionally, we explore the generation of in-distribution synthetic data that maintains the same concepts as the training data but with distinct styles, highlighting its potential for more calibrated data generation. Overall, our study introduces and validates a promising new pre-training technique based on concept-style disentanglement, opening multiple avenues for developing models for medical image analysis that are more interpretable and explainable than black-box approaches.

Paper Structure

This paper contains 14 sections, 7 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: ConceptVAE overview, where the blue blocks are trainable while the grey blocks are only updated using exponential moving average.
  • Figure 2: ConceptVAE model architecture and training setup, where the EMA blocks represent the exponential moving average mirrors of regular blocks. Loss components are shown in colored ellipses, and s.g. denotes stop-gradient. Solid arrows indicate tensor flows within the model, while dashed arrows represent tensors involved in loss functions.
  • Figure 3: Concept maps for three randomly sampled inputs. The 16$\times$-stride concept grid is up-sampled to the original image size. The indices of the most likely concept for each grid location are displayed in red at the bottom-left of each location. The grid is color-coded according to concept indices for better visualisation.
  • Figure 4: Effect of concept swapping. The left image is the reconstruction based only on the greedy concept map (with $x_{style}:=0$). The middle reconstruction illustrates the effect of swapping 2 modifier concepts, while the right reconstruction illustrates big changes induced by swapping two anatomy-specific concepts.
  • Figure 5: Region-based instance retrieval using conceptual search. The leftmost column displays query images, while the last three columns show the top-3 kNN retrieval results. Red dots indicate the centers of the query and matched descriptor regions. Below each image, the view and cardiac phase are displayed. Matches marked with an asterisk (*) are from the same acquisition as the query image, but from a different cardiac phase.
  • ...and 3 more figures