Table of Contents
Fetching ...

Introducing 3D Representation for Medical Image Volume-to-Volume Translation via Score Fusion

Xiyue Zhu, Dou Hoon Kwark, Ruike Zhu, Kaiwen Hong, Yiqi Tao, Shirui Luo, Yudu Li, Zhi-Pei Liang, Volodymyr Kindratenko

TL;DR

Score-Fusion addresses the challenge of learning 3D volumetric distributions for medical volume-to-volume translation by fusing perpendicular 2D diffusion models in score-function space via a 3D fusion network. It initializes the 3D fusion as a weighted average of 2D scores (TPDM style) and then fine-tunes, with 2D feature maps injected into the 3D layers to form an ensemble of 2D representations. It demonstrates superior accuracy and 3D realism on BraTS and HCP tasks for super-resolution and modality translation, and it improves downstream tumor segmentation performance while enabling efficient multi-modality fusion. The work provides new insight into diffusion model ensembling by learning in score-function space and offers a plug-in approach that balances performance with training efficiency.

Abstract

In volume-to-volume translations in medical images, existing models often struggle to capture the inherent volumetric distribution using 3D voxelspace representations, due to high computational dataset demands. We present Score-Fusion, a novel volumetric translation model that effectively learns 3D representations by ensembling perpendicularly trained 2D diffusion models in score function space. By carefully initializing our model to start with an average of 2D models as in TPDM, we reduce 3D training to a fine-tuning process and thereby mitigate both computational and data demands. Furthermore, we explicitly design the 3D model's hierarchical layers to learn ensembles of 2D features, further enhancing efficiency and performance. Moreover, Score-Fusion naturally extends to multi-modality settings, by fusing diffusion models conditioned on different inputs for flexible, accurate integration. We demonstrate that 3D representation is essential for better performance in downstream recognition tasks, such as tumor segmentation, where most segmentation models are based on 3D representation. Extensive experiments demonstrate that Score-Fusion achieves superior accuracy and volumetric fidelity in 3D medical image super-resolution and modality translation. Beyond these improvements, our work also provides broader insight into learning-based approaches for score function fusion.

Introducing 3D Representation for Medical Image Volume-to-Volume Translation via Score Fusion

TL;DR

Score-Fusion addresses the challenge of learning 3D volumetric distributions for medical volume-to-volume translation by fusing perpendicular 2D diffusion models in score-function space via a 3D fusion network. It initializes the 3D fusion as a weighted average of 2D scores (TPDM style) and then fine-tunes, with 2D feature maps injected into the 3D layers to form an ensemble of 2D representations. It demonstrates superior accuracy and 3D realism on BraTS and HCP tasks for super-resolution and modality translation, and it improves downstream tumor segmentation performance while enabling efficient multi-modality fusion. The work provides new insight into diffusion model ensembling by learning in score-function space and offers a plug-in approach that balances performance with training efficiency.

Abstract

In volume-to-volume translations in medical images, existing models often struggle to capture the inherent volumetric distribution using 3D voxelspace representations, due to high computational dataset demands. We present Score-Fusion, a novel volumetric translation model that effectively learns 3D representations by ensembling perpendicularly trained 2D diffusion models in score function space. By carefully initializing our model to start with an average of 2D models as in TPDM, we reduce 3D training to a fine-tuning process and thereby mitigate both computational and data demands. Furthermore, we explicitly design the 3D model's hierarchical layers to learn ensembles of 2D features, further enhancing efficiency and performance. Moreover, Score-Fusion naturally extends to multi-modality settings, by fusing diffusion models conditioned on different inputs for flexible, accurate integration. We demonstrate that 3D representation is essential for better performance in downstream recognition tasks, such as tumor segmentation, where most segmentation models are based on 3D representation. Extensive experiments demonstrate that Score-Fusion achieves superior accuracy and volumetric fidelity in 3D medical image super-resolution and modality translation. Beyond these improvements, our work also provides broader insight into learning-based approaches for score function fusion.
Paper Structure (23 sections, 7 equations, 21 figures, 12 tables, 2 algorithms)

This paper contains 23 sections, 7 equations, 21 figures, 12 tables, 2 algorithms.

Figures (21)

  • Figure 1: Comparison between TPDM (left) and Score-Fusion (right). Score-Fusion learns to ensemble pre-trained diffusion models with a 3D model, effectively utilizing 3D representations. Our model thus shows better 3D realism and demonstrates superior accuracy and realism metrics.
  • Figure 2: Overview of the Score-Fusion. At each denoising step, two pre-trained 2D models provide initial estimations of the scores in a slice-wise manner. Subsequently, a 3D network learns to integrate these estimations via 3D representation extracted from 3D input and aggregated 2D scores. In addition, the 3D model is initialized to output an average of 2D scores. Moreover, The 3D network layers are also reformulated to learn an ensemble of aggregated and aligned 2D features. These designs accelerate and stabilize the 3D training process.
  • Figure 3: Visual comparison of generated samples for three different conditions. The first three rows show axial view slices from different MRI volumes. Neither Score-Fusion nor TPDM have a 2D model trained in this direction. The last three rows show slices for the same MRI volume in all three views. Score-Fusion reconstructs more realistic details with smoother edges and fewer artifacts.
  • Figure 4: Qualitative results in HCP dataset. The input is a 4x4x4 downsampled version of the ground truth.
  • Figure 5: Comparison of Dice scores and recovery rates for super-resolution. The value on the left represents the Dice score, while the value on the right represents the recovery rate.
  • ...and 16 more figures