Table of Contents
Fetching ...

RL-U$^2$Net: A Dual-Branch UNet with Reinforcement Learning-Assisted Multimodal Feature Fusion for Accurate 3D Whole-Heart Segmentation

Jierui Qu, Jianchun Zhao

TL;DR

This paper tackles the challenge of accurate 3D whole-heart segmentation from multimodal CT and MRI by introducing RL-U^2Net, a dual-branch U-Net that jointly learns feature alignment and segmentation. A novel RL-XAlign module combines cross-modal attention (CMA) with a reinforcement learning policy (PoseAlign) to dynamically align CT and MRI features, while a fusion module and ensemble decision scheme integrate patch-level predictions. The training framework includes multi-task losses and an Adaptive Gradient Weight Distributor to balance learning across modalities, ensuring stable and cooperative optimization. On MM-WHS 2017, RL-U^2Net achieves state-of-the-art Dice scores for CT (0.931) and MRI (0.870) with superior boundary accuracy, validating the effectiveness of reinforcement-learning–assisted multimodal fusion in medical image analysis.

Abstract

Accurate whole-heart segmentation is a critical component in the precise diagnosis and interventional planning of cardiovascular diseases. Integrating complementary information from modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) can significantly enhance segmentation accuracy and robustness. However, existing multi-modal segmentation methods face several limitations: severe spatial inconsistency between modalities hinders effective feature fusion; fusion strategies are often static and lack adaptability; and the processes of feature alignment and segmentation are decoupled and inefficient. To address these challenges, we propose a dual-branch U-Net architecture enhanced by reinforcement learning for feature alignment, termed RL-U$^2$Net, designed for precise and efficient multi-modal 3D whole-heart segmentation. The model employs a dual-branch U-shaped network to process CT and MRI patches in parallel, and introduces a novel RL-XAlign module between the encoders. The module employs a cross-modal attention mechanism to capture semantic correspondences between modalities and a reinforcement-learning agent learns an optimal rotation strategy that consistently aligns anatomical pose and texture features. The aligned features are then reconstructed through their respective decoders. Finally, an ensemble-learning-based decision module integrates the predictions from individual patches to produce the final segmentation result. Experimental results on the publicly available MM-WHS 2017 dataset demonstrate that the proposed RL-U$^2$Net outperforms existing state-of-the-art methods, achieving Dice coefficients of 93.1% on CT and 87.0% on MRI, thereby validating the effectiveness and superiority of the proposed approach.

RL-U$^2$Net: A Dual-Branch UNet with Reinforcement Learning-Assisted Multimodal Feature Fusion for Accurate 3D Whole-Heart Segmentation

TL;DR

This paper tackles the challenge of accurate 3D whole-heart segmentation from multimodal CT and MRI by introducing RL-U^2Net, a dual-branch U-Net that jointly learns feature alignment and segmentation. A novel RL-XAlign module combines cross-modal attention (CMA) with a reinforcement learning policy (PoseAlign) to dynamically align CT and MRI features, while a fusion module and ensemble decision scheme integrate patch-level predictions. The training framework includes multi-task losses and an Adaptive Gradient Weight Distributor to balance learning across modalities, ensuring stable and cooperative optimization. On MM-WHS 2017, RL-U^2Net achieves state-of-the-art Dice scores for CT (0.931) and MRI (0.870) with superior boundary accuracy, validating the effectiveness of reinforcement-learning–assisted multimodal fusion in medical image analysis.

Abstract

Accurate whole-heart segmentation is a critical component in the precise diagnosis and interventional planning of cardiovascular diseases. Integrating complementary information from modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) can significantly enhance segmentation accuracy and robustness. However, existing multi-modal segmentation methods face several limitations: severe spatial inconsistency between modalities hinders effective feature fusion; fusion strategies are often static and lack adaptability; and the processes of feature alignment and segmentation are decoupled and inefficient. To address these challenges, we propose a dual-branch U-Net architecture enhanced by reinforcement learning for feature alignment, termed RL-UNet, designed for precise and efficient multi-modal 3D whole-heart segmentation. The model employs a dual-branch U-shaped network to process CT and MRI patches in parallel, and introduces a novel RL-XAlign module between the encoders. The module employs a cross-modal attention mechanism to capture semantic correspondences between modalities and a reinforcement-learning agent learns an optimal rotation strategy that consistently aligns anatomical pose and texture features. The aligned features are then reconstructed through their respective decoders. Finally, an ensemble-learning-based decision module integrates the predictions from individual patches to produce the final segmentation result. Experimental results on the publicly available MM-WHS 2017 dataset demonstrate that the proposed RL-UNet outperforms existing state-of-the-art methods, achieving Dice coefficients of 93.1% on CT and 87.0% on MRI, thereby validating the effectiveness and superiority of the proposed approach.

Paper Structure

This paper contains 34 sections, 20 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Paradigm comparisons between the existing multimodal medical image segmentation methods and our method
  • Figure 2: Overall framework of PPO training in RL-XAlign module. The framework demonstrates how CMA establishes semantic correspondences while PPO's Actor-Critic architecture learns optimal spatial alignment strategies through iterative state-action optimization for cross-modal feature fusion.
  • Figure 3: Schematic of the decision-making module based on ensemble learning.
  • Figure 4: Visualization of methods comparison on MM-WHS 2017 Dataset.
  • Figure 5: Ablation experiments of RL-XAlign and AGWD. The top rows show RL-XAlign reward curves while bottom rows compare loss dynamics with/without AGWD.
  • ...and 4 more figures