Table of Contents
Fetching ...

Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)

Gustavo Chau Loo Kung, Mohammad Abbasi, Camila Blank, Juze Zhang, Alan Q. Wang, Sophie Ostmeier, Akshay Chaudhari, Kilian Pohl, Ehsan Adeli

Abstract

Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning general-purpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores. Code is available at: github.com/gustavochau/D-RoPE.

Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)

Abstract

Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning general-purpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores. Code is available at: github.com/gustavochau/D-RoPE.

Paper Structure

This paper contains 16 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of dMRI data structure and model input. The purple sphere represents diffusion sampling directions at a specific b-value. 3D dMRI volumes acquired at different directions are divided into 3D patches. Spatial coordinates $(x,y,z)$ define the image domain and spherical coordinates $(\rho,\theta,\varphi)$ describe the diffusion space. Patches are linearized with absolute positional embeddings and processed by a transformer encoder incorporating relative positional encoding (D-RoPE; see Section \ref{['sec:drope']} and Fig. \ref{['fig:drope']}).
  • Figure 2: Summary of the model architecture. The different diffusion volumes are patchified via a projection layer and the absolute positional information of the image space and the diffusion space is added to the the patch embeddings. The tokens are passed through modified attention blocks that alternate attention in both image and diffusion spaces. Additionally, relative information of different diffusion directions and b-values is encoded via D-RoPE. The obtained latent representations from the encoder are used for downstream task evaluation. To close the pretraining loop, some tokens are masked and fed into transformer blocks as the ones used in the encoder, followed by 3D convolutional blocks to reconstruct the original volumes.
  • Figure 3: Attention calculation with D-RoPE: The relative rotation matrix depends on the distance between diffusion volumes, which is defined as a weighted combination of the distance between b-values and the angle between b-vectors.
  • Figure 4: Qualitative evaluation of the reconstructions for a test subject for a b-value of 2000 $s^2/mm$. The reconstruction obtained with different masking strategies (Spatial, Diffusion, and Alternating) and with and without D-RoPE are shown in the three left-most columns. The reference ground-truth image and the spatial mask that was used are shown in the two right-most columns (white=masked). In general, a good reconstruction of the brain structures are observed. The MAE without D-RoPE case shows an erroneous contrast when only diffusion space masking is performed, possibly because of the lack of spatial information for a given diffusion direction.
  • Figure 5: Qualitative evaluation of the reconstructions for a test subject for a b-value of 1000 $s^2/mm$. The reconstruction obtained from different masking strategies (Spatial, Diffusion and Alternating) and with and without D-RoPE are shown in the three left-most columns. The reference ground-truth image and the spatial mask that was used are shown in the two right-most columns.
  • ...and 3 more figures