Table of Contents
Fetching ...

MTVNet: Mapping using Transformers for Volumes -- Network for Super-Resolution with Long-Range Interactions

August Leander Høeg, Sophia W. Bardenfleth, Hans Martin Kjer, Tim B. Dyrby, Vedrana Andersen Dahl, Anders Dahl

TL;DR

MTVNet addresses the challenge of applying transformers to 3D volumetric SR by introducing a three-level, multi-contextual architecture with carrier tokens and a shifting hierarchical attention mechanism (SVHAT). The model uses coarse-to-fine feature extraction and cross-scale fusion via cross-attention to expand the effective receptive field while keeping memory usage practical. On high-resolution volumetric data (FACTS), MTVNet achieves state-of-the-art performance, with ablations confirming gains from multi-context and CAT-based cross-scale interactions; on brain MRI datasets, it remains competitive, highlighting data-domain dependencies. The work demonstrates that multi-contextual transformer designs can unlock long-range dependencies in 3D SR and may generalize to other volumetric vision tasks such as segmentation and classification.

Abstract

Until now, it has been difficult for volumetric super-resolution to utilize the recent advances in transformer-based models seen in 2D super-resolution. The memory required for self-attention in 3D volumes limits the receptive field. Therefore, long-range interactions are not used in 3D to the extent done in 2D and the strength of transformers is not realized. We propose a multi-scale transformer-based model based on hierarchical attention blocks combined with carrier tokens at multiple scales to overcome this. Here information from larger regions at coarse resolution is sequentially carried on to finer-resolution regions to predict the super-resolved image. Using transformer layers at each resolution, our coarse-to-fine modeling limits the number of tokens at each scale and enables attention over larger regions than what has previously been possible. We experimentally compare our method, MTVNet, against state-of-the-art volumetric super-resolution models on five 3D datasets demonstrating the advantage of an increased receptive field. This advantage is especially pronounced for images that are larger than what is seen in popularly used 3D datasets. Our code is available at https://github.com/AugustHoeg/MTVNet

MTVNet: Mapping using Transformers for Volumes -- Network for Super-Resolution with Long-Range Interactions

TL;DR

MTVNet addresses the challenge of applying transformers to 3D volumetric SR by introducing a three-level, multi-contextual architecture with carrier tokens and a shifting hierarchical attention mechanism (SVHAT). The model uses coarse-to-fine feature extraction and cross-scale fusion via cross-attention to expand the effective receptive field while keeping memory usage practical. On high-resolution volumetric data (FACTS), MTVNet achieves state-of-the-art performance, with ablations confirming gains from multi-context and CAT-based cross-scale interactions; on brain MRI datasets, it remains competitive, highlighting data-domain dependencies. The work demonstrates that multi-contextual transformer designs can unlock long-range dependencies in 3D SR and may generalize to other volumetric vision tasks such as segmentation and classification.

Abstract

Until now, it has been difficult for volumetric super-resolution to utilize the recent advances in transformer-based models seen in 2D super-resolution. The memory required for self-attention in 3D volumes limits the receptive field. Therefore, long-range interactions are not used in 3D to the extent done in 2D and the strength of transformers is not realized. We propose a multi-scale transformer-based model based on hierarchical attention blocks combined with carrier tokens at multiple scales to overcome this. Here information from larger regions at coarse resolution is sequentially carried on to finer-resolution regions to predict the super-resolved image. Using transformer layers at each resolution, our coarse-to-fine modeling limits the number of tokens at each scale and enables attention over larger regions than what has previously been possible. We experimentally compare our method, MTVNet, against state-of-the-art volumetric super-resolution models on five 3D datasets demonstrating the advantage of an increased receptive field. This advantage is especially pronounced for images that are larger than what is seen in popularly used 3D datasets. Our code is available at https://github.com/AugustHoeg/MTVNet

Paper Structure

This paper contains 20 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of MTVNet that is informed by a large contextual volume processed at multiple resolution scales for predicting SR in the center volume.
  • Figure 2: Illustration of MTVNet and the structure of DCHAT Block and DCHAT Group. Our proposed architecture consists of up to three levels of multi-contextual volumetric image processing. The first two levels perform tokenization using larger 3D patch sizes to cover broader contextual regions, while succeeding levels process subsets of the input volume using smaller patch sizes, resulting in both coarse- and fine-grained feature extraction. The depth of subsequent DCHAT Groups increases from $n = 1$ to $3$ DCHAT Blocks towards the last stage. The token embeddings from preceding network levels are fused into later levels using cross attention.
  • Figure 3: Illustration of volumetric attention mechanisms used in SVHAT: \ref{['fig:full_cat_attn']}) Full CAT attention, \ref{['fig:msa_w_cat']}) W-MSA with CAT and \ref{['fig:swmsa_w_cat']}) SW-MSA with CAT. Our proposed SVHAT uses alternating shifted and non-shifted windowed attention. Masking is used to limit information exchange between non-adjacent ITEs and CATs. In these examples, the window size is $M = 4$ and the CAT space size is $c=2$.
  • Figure 4: Visual comparisons of SR model outputs from the datasets HCP 1200, IXI, FACTS-Synth, and FACTS-Real using $4\times$ upscaling. The ground truth (GT) and LR input images are shown side-by-side in the top-left separated by the red line.
  • Figure 5: GPU Memory usage of SuperFormer, RRDBNet3D, and MTVNet using a single 3D patch at resolutions $16^3$, $32^3$, $48^3$, and $64^3$. Adding contextual levels to MTVNet enables increasing resolution to $128^3$ and beyond without exceeding GPU memory.
  • ...and 3 more figures