Table of Contents
Fetching ...

Continuous Spatio-Temporal Memory Networks for 4D Cardiac Cine MRI Segmentation

Meng Ye, Bingyu Xin, Leon Axel, Dimitris Metaxas

TL;DR

Results of extensive experiments across multiple cMR datasets show that the proposed CSTM network can improve the 4D cMR segmentation performance, especially for the hard-to-segment regions.

Abstract

Current cardiac cine magnetic resonance image (cMR) studies focus on the end diastole (ED) and end systole (ES) phases, while ignoring the abundant temporal information in the whole image sequence. This is because whole sequence segmentation is currently a tedious process and inaccurate. Conventional whole sequence segmentation approaches first estimate the motion field between frames, which is then used to propagate the mask along the temporal axis. However, the mask propagation results could be prone to error, especially for the basal and apex slices, where through-plane motion leads to significant morphology and structural change during the cardiac cycle. Inspired by recent advances in video object segmentation (VOS), based on spatio-temporal memory (STM) networks, we propose a continuous STM (CSTM) network for semi-supervised whole heart and whole sequence cMR segmentation. Our CSTM network takes full advantage of the spatial, scale, temporal and through-plane continuity prior of the underlying heart anatomy structures, to achieve accurate and fast 4D segmentation. Results of extensive experiments across multiple cMR datasets show that our method can improve the 4D cMR segmentation performance, especially for the hard-to-segment regions.

Continuous Spatio-Temporal Memory Networks for 4D Cardiac Cine MRI Segmentation

TL;DR

Results of extensive experiments across multiple cMR datasets show that the proposed CSTM network can improve the 4D cMR segmentation performance, especially for the hard-to-segment regions.

Abstract

Current cardiac cine magnetic resonance image (cMR) studies focus on the end diastole (ED) and end systole (ES) phases, while ignoring the abundant temporal information in the whole image sequence. This is because whole sequence segmentation is currently a tedious process and inaccurate. Conventional whole sequence segmentation approaches first estimate the motion field between frames, which is then used to propagate the mask along the temporal axis. However, the mask propagation results could be prone to error, especially for the basal and apex slices, where through-plane motion leads to significant morphology and structural change during the cardiac cycle. Inspired by recent advances in video object segmentation (VOS), based on spatio-temporal memory (STM) networks, we propose a continuous STM (CSTM) network for semi-supervised whole heart and whole sequence cMR segmentation. Our CSTM network takes full advantage of the spatial, scale, temporal and through-plane continuity prior of the underlying heart anatomy structures, to achieve accurate and fast 4D segmentation. Results of extensive experiments across multiple cMR datasets show that our method can improve the 4D cMR segmentation performance, especially for the hard-to-segment regions.

Paper Structure

This paper contains 17 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Cardiac cine magnetic resonance (cMR) images. (a) and (d) show the long-axis 4 chamber views; (b) and (c) show the short-axis views at/near (1) basal, (2) middle and (3) apex region. While (a) and (b) show the images at the end diastole (ED) phase, (c) and (d) show the images at the end systole (ES) phase. Through-plane motion of the heart causes in-plane structural change, which can be observed in (b) and (c), especially for the basal and apex slices. The red box in (b) and (c) shows the area of/nearby the heart ventricles. Note that slice position 1 corresponds to the left ventricle (LV) and right ventricle (RV) at ED, but to the left atrium (LA) and right atrium (RA) at ES; slice position 3 has an intersection with the LV apex at ED, but not at ES.
  • Figure 2: Architecture of CSTM. Both key and value encoders are based on ResNet. The key encoder takes a query/memory frame as input and outputs multi-scale query/memory key features. The value encoder inputs a memory frame with the corresponding segmentation mask and outputs multi-scale value features. We perform patch-level memory matching (PLMM) at scale 3 and 4 to read out memory values, which are fed into the decoder to output the segmentation mask of the query frame.
  • Figure 3: An illustration of patch-level memory matching (PLMM). We first divide an image or feature map into patches, then we match each query patch in (a) with top-K memory patches in (b) and (c). Finally, we perform dense pixel-level matching between the query patch and the top-K memory patches. In (d), we show the pixel-level affinity map of the query pixel in (a) with memory 2 in (c). In (e), we show the patch-level affinity map of the query patch in (a) with memory 2 in (c). PLMM can efficiently filter out noisy matches, e.g., the dashed line between the left ventricle area (purple) and the stomach area (gray), by leveraging the local spatial continuity prior in an image.
  • Figure 4: Multi-scale memory matching by leveraging the scale continuity prior in a feature pyramid. The feature map size at scale $s$ is 1/4 of that at scale $s-1$. We set the patch size at scale $s-1$ as 4$\times$ of that at scale $s$. After performing patch matching at scale $s$, we directly copy the $topk_{-}id$ to scale $s-1$. The scale continuity prior can ensure the patch matching accuracy at the coarser scale $s-1$ by passing the matching results across scales.
  • Figure 5: An illustration of the inference strategy of CSTM. The annotated frame is shown in the purple box ($t_{z0}=0$), in which red is the LV, green is the myocardium wall (Myo), and blue is the RV. We propagate the mask first along the temporal $t$-axis (blue line); then along the $z$-axis (green and yellow lines). For each query frame $t_{z}=\tau$ (red box) in the basal or middle region, we use the memory at $t_{z0}=0$ and $t_{z-1}=\tau$ for spatio-temporal matching. For each query frame $t_{z}=\tau$ in the apex region, we use the memory at $t_{z0}=0$, $t_{z+1}=\tau$ and $t_{z}=\tau-1$, for spatio-temporal matching.
  • ...and 1 more figures