Table of Contents
Fetching ...

NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization

Yik San Cheng, Runkai Zhao, Weidong Cai

Abstract

2D visual foundation models, such as DINOv3, a self-supervised model trained on large-scale natural images, have demonstrated strong zero-shot generalization, capturing both rich global context and fine-grained structural cues. However, an analogous 3D foundation model for downstream volumetric neuroimaging remains lacking, largely due to the challenges of 3D image acquisition and the scarcity of high-quality annotations. To address this gap, we propose to adapt the 2D visual representations learned by DINOv3 to a 3D biomedical segmentation model, enabling more data-efficient and morphologically faithful neuronal reconstruction. Specifically, we design an inflation-based adaptation strategy that inflates 2D filters into 3D operators, preserving semantic priors from DINOv3 while adapting to 3D neuronal volume patches. In addition, we introduce a topology-aware skeleton loss to explicitly enforce structural fidelity of graph-based neuronal arbor reconstruction. Extensive experiments on four neuronal imaging datasets, including two from BigNeuron and two public datasets, NeuroFly and CWMBS, demonstrate consistent improvements in reconstruction accuracy over SoTA methods, with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure. Code: https://github.com/yy0007/NeurINO.

NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization

Abstract

2D visual foundation models, such as DINOv3, a self-supervised model trained on large-scale natural images, have demonstrated strong zero-shot generalization, capturing both rich global context and fine-grained structural cues. However, an analogous 3D foundation model for downstream volumetric neuroimaging remains lacking, largely due to the challenges of 3D image acquisition and the scarcity of high-quality annotations. To address this gap, we propose to adapt the 2D visual representations learned by DINOv3 to a 3D biomedical segmentation model, enabling more data-efficient and morphologically faithful neuronal reconstruction. Specifically, we design an inflation-based adaptation strategy that inflates 2D filters into 3D operators, preserving semantic priors from DINOv3 while adapting to 3D neuronal volume patches. In addition, we introduce a topology-aware skeleton loss to explicitly enforce structural fidelity of graph-based neuronal arbor reconstruction. Extensive experiments on four neuronal imaging datasets, including two from BigNeuron and two public datasets, NeuroFly and CWMBS, demonstrate consistent improvements in reconstruction accuracy over SoTA methods, with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure. Code: https://github.com/yy0007/NeurINO.
Paper Structure (28 sections, 18 equations, 12 figures, 11 tables)

This paper contains 28 sections, 18 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: a. We propose a data-efficient 3D neuron segmentation model, namely NeurINO, bridging a 2D foundation model (DINOv3) with 3D volumetric neuroimaging; b. F1-score comparison across four neuronal datasets, where bubble size indicates model parameters. NeurINO achieves consistent segmentation improvements with comparable model complexity; c. NeurINO marginally outperforms the second-best methods (MedNeXt and nnUNet) by 1–9%, achieving the lowest HD95 across all datasets.
  • Figure 2: Overview of our framework. We adapt a pretrained DINOv3 ConvNeXt encoder for volumetric neuron segmentation through an inflation-based 3D adaptation strategy, which spatially expands 2D convolutional kernels into 3D while preserving pretrained semantics. The inflated encoder is coupled with a symmetric MedNeXt-style decoder to recover fine-grained neuronal morphology. Multi-level outputs are supervised jointly with the proposed Topology-Aware Skeleton Loss (TASL), which measures structural discrepancies between predicted and ground-truth skeleton graphs. $S_{2D}$ and $S_{3D}$ denote the spatial sizes in 2D and 3D feature maps, respectively.
  • Figure 3: Visualization comparison of the segmentation and tracing results on the Drosophila dataset. The top row presents raw images and segmentation outputs, followed by two rows showing the corresponding reconstruction results generated by SmartTracing and NeuTube. Magenta boxes indicate severe false negatives (missed neurites) in other methods. Best viewed in zoom-in regions.
  • Figure 4: Visualization comparison of the segmentation and tracing results on the Mouse dataset. The top row presents raw images and segmentation outputs, followed by two rows showing the corresponding reconstruction results generated by SmartTracing and NeuTube. Magenta arrows highlight severe false positives in other methods. Best viewed in zoom-in regions.
  • Figure 5: Visualization comparison of the segmentation and tracing results on the NeuroFly dataset. The top row presents raw images and segmentation outputs, followed by two rows showing the corresponding reconstruction results generated by SmartTracing and NeuTube. Magenta boxes indicate severe false negatives (missed neurites) in other methods. Best viewed in zoom-in regions.
  • ...and 7 more figures