Label-Efficient 3D Brain Segmentation via Complementary 2D Diffusion Models with Orthogonal Views
Jihoon Cho, Suhyun Ahn, Beomju Kim, Hyungjoon Bae, Xiaofeng Liu, Fangxu Xing, Kyungeun Lee, Georges Elfakhri, Van Wedeen, Jonghye Woo, Jinah Park
TL;DR
This work tackles data-efficient 3D brain segmentation by leveraging complementary 2D diffusion models trained on orthogonal views to extract 3D semantic features, which are fused into voxelwise representations to train a simple MLP segmenter. By using a large unlabeled dataset for pretraining and a sparse labeling scheme (as few as nine slices with a background region), the approach achieves competitive and often superior segmentation performance for subcortical structures compared with state-of-the-art self-supervised methods, even when the labeled data per subject is minimal. Importantly, it avoids heavy 3D diffusion modeling and demonstrates robust 3D context integration through feature fusion from axial, coronal, and sagittal views, enabling label-efficient deployment in clinical settings.
Abstract
Deep learning-based segmentation techniques have shown remarkable performance in brain segmentation, yet their success hinges on the availability of extensive labeled training data. Acquiring such vast datasets, however, poses a significant challenge in many clinical applications. To address this issue, in this work, we propose a novel 3D brain segmentation approach using complementary 2D diffusion models. The core idea behind our approach is to first mine 2D features with semantic information extracted from the 2D diffusion models by taking orthogonal views as input, followed by fusing them into a 3D contextual feature representation. Then, we use these aggregated features to train multi-layer perceptrons to classify the segmentation labels. Our goal is to achieve reliable segmentation quality without requiring complete labels for each individual subject. Our experiments on training in brain subcortical structure segmentation with a dataset from only one subject demonstrate that our approach outperforms state-of-the-art self-supervised learning methods. Further experiments on the minimum requirement of annotation by sparse labeling yield promising results even with only nine slices and a labeled background region.
