Table of Contents
Fetching ...

VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging

Yufan He, Pengfei Guo, Yucheng Tang, Andriy Myronenko, Vishwesh Nath, Ziyue Xu, Dong Yang, Can Zhao, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, Daguang Xu, Wenqi Li

TL;DR

The model, recipe, and insights represent a promising step towards a clinically useful 3D foundation model and VISTA3D is the first model to achieve state-of-the-art performance in both 3D automatic and interactive segmentation, even when compared with top 3D expert models on large and diverse benchmarks.

Abstract

Foundation models for interactive segmentation in 2D natural images and videos have sparked significant interest in building 3D foundation models for medical imaging. However, the domain gaps and clinical use cases for 3D medical imaging require a dedicated model that diverges from existing 2D solutions. Specifically, such foundation models should support a full workflow that can actually reduce human effort. Treating 3D medical images as sequences of 2D slices and reusing interactive 2D foundation models seems straightforward, but 2D annotation is too time-consuming for 3D tasks. Moreover, for large cohort analysis, it's the highly accurate automatic segmentation models that reduce the most human effort. However, these models lack support for interactive corrections and lack zero-shot ability for novel structures, which is a key feature of "foundation". While reusing pre-trained 2D backbones in 3D enhances zero-shot potential, their performance on complex 3D structures still lags behind leading 3D models. To address these issues, we present VISTA3D, Versatile Imaging SegmenTation and Annotation model, that targets to solve all these challenges and requirements with one unified foundation model. VISTA3D is built on top of the well-established 3D segmentation pipeline, and it is the first model to achieve state-of-the-art performance in both 3D automatic (supporting 127 classes) and 3D interactive segmentation, even when compared with top 3D expert models on large and diverse benchmarks. Additionally, VISTA3D's 3D interactive design allows efficient human correction, and a novel 3D supervoxel method that distills 2D pretrained backbones grants VISTA3D top 3D zero-shot performance. We believe the model, recipe, and insights represent a promising step towards a clinically useful 3D foundation model. Code and weights are publicly available at https://github.com/Project-MONAI/VISTA.

VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging

TL;DR

The model, recipe, and insights represent a promising step towards a clinically useful 3D foundation model and VISTA3D is the first model to achieve state-of-the-art performance in both 3D automatic and interactive segmentation, even when compared with top 3D expert models on large and diverse benchmarks.

Abstract

Foundation models for interactive segmentation in 2D natural images and videos have sparked significant interest in building 3D foundation models for medical imaging. However, the domain gaps and clinical use cases for 3D medical imaging require a dedicated model that diverges from existing 2D solutions. Specifically, such foundation models should support a full workflow that can actually reduce human effort. Treating 3D medical images as sequences of 2D slices and reusing interactive 2D foundation models seems straightforward, but 2D annotation is too time-consuming for 3D tasks. Moreover, for large cohort analysis, it's the highly accurate automatic segmentation models that reduce the most human effort. However, these models lack support for interactive corrections and lack zero-shot ability for novel structures, which is a key feature of "foundation". While reusing pre-trained 2D backbones in 3D enhances zero-shot potential, their performance on complex 3D structures still lags behind leading 3D models. To address these issues, we present VISTA3D, Versatile Imaging SegmenTation and Annotation model, that targets to solve all these challenges and requirements with one unified foundation model. VISTA3D is built on top of the well-established 3D segmentation pipeline, and it is the first model to achieve state-of-the-art performance in both 3D automatic (supporting 127 classes) and 3D interactive segmentation, even when compared with top 3D expert models on large and diverse benchmarks. Additionally, VISTA3D's 3D interactive design allows efficient human correction, and a novel 3D supervoxel method that distills 2D pretrained backbones grants VISTA3D top 3D zero-shot performance. We believe the model, recipe, and insights represent a promising step towards a clinically useful 3D foundation model. Code and weights are publicly available at https://github.com/Project-MONAI/VISTA.
Paper Structure (11 sections, 7 figures, 3 tables, 2 algorithms)

This paper contains 11 sections, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: Fig.(a) shows the full human-in-the-loop workflow VISTA3D supports. If the segmentation task $X$ is within 127 supported classes (left green circle), VISTA3D performs accurate automatic segmentation. The doctor can inspect and efficiently edit the result with VISTA3D if needed. If $X$ is a novel class (right blue circle), VISTA3D performs 3D interactive zero-shot segmentation. Fig.(b) shows the VISTA3D architecture. It contains two branches that share the same image encoder. The top auto-branch will activate out-of-the-box automatic segmentation if user provide a class prompt that's within 127 supported classes. The bottom interactive branch will activate interactive segmentation if user provide 3D point click prompts. If both branches are activated, a merger module based on Alg. \ref{['alg:ir']} will use interactive results to edit automatic results.
  • Figure 2: Generated supervoxel from Alg. \ref{['alg:supervoxel']}, showing examples in axial, sagittal, and coronal views. Different colours represent different supervoxels.
  • Figure 3: Correcting automatic segmentation with points. The left figure shows the automatic liver segmentation with a false negative area. After a positive point, the false negative region is corrected. The third figure shows another slice with a false positive and a negative point removed from the region shown in the last figure.
  • Figure 4: An example of monkey CT scan (2 sagittal slices). We can see that VISTA3D achieved more robust segmentation.
  • Figure 5: Zero-shot dice scores. The X-axis is the number of click points. The Y-axis is the average dice score over the whole dataset.
  • ...and 2 more figures