Table of Contents
Fetching ...

Leveraging Self-Supervised Vision Transformers for Segmentation-based Transfer Function Design

Dominik Engel, Leon Sick, Timo Ropinski

TL;DR

The paper tackles the challenge of semantically meaningful transfer function design for volume rendering while preserving interactivity. It proposes an annotation-driven pipeline that leverages frozen self-supervised Vision Transformer features to locate structures of interest without training new models, merging 2D feature maps into a 3D volume to support rapid similarity-based selection. An optional 3D Fast Bilateral Solver refines similarity maps for high-quality rendering, enabling transfer function design within seconds to minutes. Quantitative and qualitative results show competitive segmentation performance with orders of magnitude fewer annotations than traditional learning-based methods and strong usability in a user study, highlighting practical impact for exploratory visualization across CT, MRI, and other modalities.

Abstract

In volume rendering, transfer functions are used to classify structures of interest, and to assign optical properties such as color and opacity. They are commonly defined as 1D or 2D functions that map simple features to these optical properties. As the process of designing a transfer function is typically tedious and unintuitive, several approaches have been proposed for their interactive specification. In this paper, we present a novel method to define transfer functions for volume rendering by leveraging the feature extraction capabilities of self-supervised pre-trained vision transformers. To design a transfer function, users simply select the structures of interest in a slice viewer, and our method automatically selects similar structures based on the high-level features extracted by the neural network. Contrary to previous learning-based transfer function approaches, our method does not require training of models and allows for quick inference, enabling an interactive exploration of the volume data. Our approach reduces the amount of necessary annotations by interactively informing the user about the current classification, so they can focus on annotating the structures of interest that still require annotation. In practice, this allows users to design transfer functions within seconds, instead of minutes. We compare our method to existing learning-based approaches in terms of annotation and compute time, as well as with respect to segmentation accuracy. Our accompanying video showcases the interactivity and effectiveness of our method.

Leveraging Self-Supervised Vision Transformers for Segmentation-based Transfer Function Design

TL;DR

The paper tackles the challenge of semantically meaningful transfer function design for volume rendering while preserving interactivity. It proposes an annotation-driven pipeline that leverages frozen self-supervised Vision Transformer features to locate structures of interest without training new models, merging 2D feature maps into a 3D volume to support rapid similarity-based selection. An optional 3D Fast Bilateral Solver refines similarity maps for high-quality rendering, enabling transfer function design within seconds to minutes. Quantitative and qualitative results show competitive segmentation performance with orders of magnitude fewer annotations than traditional learning-based methods and strong usability in a user study, highlighting practical impact for exploratory visualization across CT, MRI, and other modalities.

Abstract

In volume rendering, transfer functions are used to classify structures of interest, and to assign optical properties such as color and opacity. They are commonly defined as 1D or 2D functions that map simple features to these optical properties. As the process of designing a transfer function is typically tedious and unintuitive, several approaches have been proposed for their interactive specification. In this paper, we present a novel method to define transfer functions for volume rendering by leveraging the feature extraction capabilities of self-supervised pre-trained vision transformers. To design a transfer function, users simply select the structures of interest in a slice viewer, and our method automatically selects similar structures based on the high-level features extracted by the neural network. Contrary to previous learning-based transfer function approaches, our method does not require training of models and allows for quick inference, enabling an interactive exploration of the volume data. Our approach reduces the amount of necessary annotations by interactively informing the user about the current classification, so they can focus on annotating the structures of interest that still require annotation. In practice, this allows users to design transfer functions within seconds, instead of minutes. We compare our method to existing learning-based approaches in terms of annotation and compute time, as well as with respect to segmentation accuracy. Our accompanying video showcases the interactivity and effectiveness of our method.
Paper Structure (25 sections, 4 equations, 12 figures, 3 tables)

This paper contains 25 sections, 4 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Method Overview. In the Feature Extraction Pre-Processing step, the volume data $\mathop{\mathrm{\mathcal{V}}}\nolimits$ is sliced along each axis and fed separately through the pre-trained DINO network. The resulting features are merged into a feature volume $\mathop{\mathrm{\mathcal{F}}}\nolimits$. Then, the user starts with Annotation in a slice viewer. Whenever the user annotates new voxels, we immediately Compute Similarity (blue highlights) of the annotated samples (orange circles) with the feature volume $\mathop{\mathrm{\mathcal{F}}}\nolimits$ (see Fig. \ref{['fig:gui']} for a step-by-step visualization). With the immediate feedback, the user can focus on the few regions that are missing after the initial annotations. Once the user is satisfied with $\mathop{\mathrm{\mathcal{S}_{\text{L}}}}\nolimits$, they can enable the bilateral solver (BLS) as a Post-Process to obtain $\mathop{\mathrm{\mathcal{S}_{\text{H}}}}\nolimits$ with increased resolution. The whole process typically takes less than one minute in practice and is repeated for each class. Please watch the https://youtu.be/kTPBCYJtEJc for a demonstration.
  • Figure 2: Annotation Interface. The user is presented with a slice viewer and a 3D rendering. Annotations can be either brushed using the mouse or set using individual points. After an annotation is set, the similarity map $\mathop{\mathrm{\mathcal{S}_{\text{L}}}}\nolimits$ is computed and displayed (blue) together with the annotation positions (orange circles). The 3D view displays an iso-surface rendering of $\mathop{\mathrm{\mathcal{S}_{\text{L}}}}\nolimits$. The similarity map informs the user where further annotations are required to fully segment the desired region. After just 3 annotations, the lung is mostly detected, and we can refine this result using the bilateral solver to obtain $\mathop{\mathrm{\mathcal{S}_{\text{H}}}}\nolimits$.
  • Figure 3: Qualitative Results. We apply our method to various volume datasets, namely Bonsai, Tooth and the MRI Heart. Each of the classes required between 3 and 9 annotations.
  • Figure 4: Qualitative Results on MRI (VisContest2010)
  • Figure 5: Qualitative Results on CT (Carp)
  • ...and 7 more figures