Table of Contents
Fetching ...

MRNet: Multifaceted Resilient Networks for Medical Image-to-Image Translation

Hyojeong Lee, Youngwan Jo, Inpyo Hong, Sanghyun Park

TL;DR

MRNet tackles the challenge of cross-modality medical image translation by integrating SAM-based frequency-aware features into a UNet-like architecture and coupling it with a dual-mask correction mechanism and a multi-term loss. The method leverages a Hierarchical SAM-Based Encoder to fuse SAM features at multiple resolutions and employs a multimask framework to refine feature selection, achieving state-of-the-art results on MRI-to-CT and MRI-to-MRI tasks. Quantitative gains are demonstrated across PSNR and SSIM, with MRNet outperforming strong baselines such as Pix2Pix, CycleGAN, TransUNet, Swin-T, ResViT, and PPT, while ablation confirms the importance of mask count and mask-based loss. The work offers practical implications for improved anatomical fidelity in clinical translation pipelines and suggests directions for efficiency and plane-aware extensions with clinical validation metrics.

Abstract

We propose a Multifaceted Resilient Network(MRNet), a novel architecture developed for medical image-to-image translation that outperforms state-of-the-art methods in MRI-to-CT and MRI-to-MRI conversion. MRNet leverages the Segment Anything Model (SAM) to exploit frequency-based features to build a powerful method for advanced medical image transformation. The architecture extracts comprehensive multiscale features from diverse datasets using a powerful SAM image encoder and performs resolution-aware feature fusion that consistently integrates U-Net encoder outputs with SAM-derived features. This fusion optimizes the traditional U-Net skip connection while leveraging transformer-based contextual analysis. The translation is complemented by an innovative dual-mask configuration incorporating dynamic attention patterns and a specialized loss function designed to address regional mapping mismatches, preserving both the gross anatomy and tissue details. Extensive validation studies have shown that MRNet outperforms state-of-the-art architectures, particularly in maintaining anatomical fidelity and minimizing translation artifacts.

MRNet: Multifaceted Resilient Networks for Medical Image-to-Image Translation

TL;DR

MRNet tackles the challenge of cross-modality medical image translation by integrating SAM-based frequency-aware features into a UNet-like architecture and coupling it with a dual-mask correction mechanism and a multi-term loss. The method leverages a Hierarchical SAM-Based Encoder to fuse SAM features at multiple resolutions and employs a multimask framework to refine feature selection, achieving state-of-the-art results on MRI-to-CT and MRI-to-MRI tasks. Quantitative gains are demonstrated across PSNR and SSIM, with MRNet outperforming strong baselines such as Pix2Pix, CycleGAN, TransUNet, Swin-T, ResViT, and PPT, while ablation confirms the importance of mask count and mask-based loss. The work offers practical implications for improved anatomical fidelity in clinical translation pipelines and suggests directions for efficiency and plane-aware extensions with clinical validation metrics.

Abstract

We propose a Multifaceted Resilient Network(MRNet), a novel architecture developed for medical image-to-image translation that outperforms state-of-the-art methods in MRI-to-CT and MRI-to-MRI conversion. MRNet leverages the Segment Anything Model (SAM) to exploit frequency-based features to build a powerful method for advanced medical image transformation. The architecture extracts comprehensive multiscale features from diverse datasets using a powerful SAM image encoder and performs resolution-aware feature fusion that consistently integrates U-Net encoder outputs with SAM-derived features. This fusion optimizes the traditional U-Net skip connection while leveraging transformer-based contextual analysis. The translation is complemented by an innovative dual-mask configuration incorporating dynamic attention patterns and a specialized loss function designed to address regional mapping mismatches, preserving both the gross anatomy and tissue details. Extensive validation studies have shown that MRNet outperforms state-of-the-art architectures, particularly in maintaining anatomical fidelity and minimizing translation artifacts.

Paper Structure

This paper contains 19 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of generator G shows how the encoder network E processes a sampled feature map, which is then concatenated with the encoder features. The S-clamp facilitates a smooth transition of the feature at the lowest level between 0 and 1. Finally, the last stage of the decoder operates through the mask module, which is a multistage convolution-based component.
  • Figure 2: (Left) Original image (Middle); Image from the first layer of the encoder; (Right) Sum of the first layer of the encoder and the SAM feature received by the first layer of the decoder.
  • Figure 3: This visualization presents the results for the z-axis of the image, which has been translated into the CT modality using the MRI input. MRNet displayed the highest performance based on individual PSNR values. Unlike other methods, which resulted in a jittery appearance or inaccurately rendered bone structures in areas where they should not appear, our approach demonstrated a clean and stable translation. The top row displays the translation results, whereas the middle row offers an enlarged view of the area highlighted in the yellow box. The error map at the bottom clearly indicates that our method, which exhibited the least activation, produced the most stable outcome.
  • Figure 4: This visualization presents the results for the y-axis of the image, which has been translated into the CT modality using the MRI input. The top row depicts the translation results. The middle row provides an enlarged view of the area highlighted within the yellow box. The differences from the ground truth are illustrated in the error map shown at the bottom. MRNet achieved the second highest performance in terms of individual visualization of the PSNR values. However, the visualizations show that it is closer to the ground truth than the first-performing method. The transformer-only method exhibited grid-like and mottled features.