Table of Contents
Fetching ...

Deep Learning in Medical Image Registration: Magic or Mirage?

Rohit Jena, Deeksha Sethi, Pratik Chaudhari, James C. Gee

TL;DR

This paper makes an explicit correspondence between the mutual information of the distribution of per-pixel intensity and labels, and the performance of classical registration methods, and proposes a general recipe to choose the best paradigm for a given registration problem.

Abstract

Classical optimization and learning-based methods are the two reigning paradigms in deformable image registration. While optimization-based methods boast generalizability across modalities and robust performance, learning-based methods promise peak performance, incorporating weak supervision and amortized optimization. However, the exact conditions for either paradigm to perform well over the other are shrouded and not explicitly outlined in the existing literature. In this paper, we make an explicit correspondence between the mutual information of the distribution of per-pixel intensity and labels, and the performance of classical registration methods. This strong correlation hints to the fact that architectural designs in learning-based methods is unlikely to affect this correlation, and therefore, the performance of learning-based methods. This hypothesis is thoroughly validated with state-of-the-art classical and learning-based methods. However, learning-based methods with weak supervision can perform high-fidelity intensity and label registration, which is not possible with classical methods. Next, we show that this high-fidelity feature learning does not translate to invariance to domain shift, and learning-based methods are sensitive to such changes in the data distribution. Finally, we propose a general recipe to choose the best paradigm for a given registration problem, based on these observations.

Deep Learning in Medical Image Registration: Magic or Mirage?

TL;DR

This paper makes an explicit correspondence between the mutual information of the distribution of per-pixel intensity and labels, and the performance of classical registration methods, and proposes a general recipe to choose the best paradigm for a given registration problem.

Abstract

Classical optimization and learning-based methods are the two reigning paradigms in deformable image registration. While optimization-based methods boast generalizability across modalities and robust performance, learning-based methods promise peak performance, incorporating weak supervision and amortized optimization. However, the exact conditions for either paradigm to perform well over the other are shrouded and not explicitly outlined in the existing literature. In this paper, we make an explicit correspondence between the mutual information of the distribution of per-pixel intensity and labels, and the performance of classical registration methods. This strong correlation hints to the fact that architectural designs in learning-based methods is unlikely to affect this correlation, and therefore, the performance of learning-based methods. This hypothesis is thoroughly validated with state-of-the-art classical and learning-based methods. However, learning-based methods with weak supervision can perform high-fidelity intensity and label registration, which is not possible with classical methods. Next, we show that this high-fidelity feature learning does not translate to invariance to domain shift, and learning-based methods are sensitive to such changes in the data distribution. Finally, we propose a general recipe to choose the best paradigm for a given registration problem, based on these observations.
Paper Structure (15 sections, 4 equations, 5 figures)

This paper contains 15 sections, 4 equations, 5 figures.

Figures (5)

  • Figure 1: Correlation between Dice Score and Mutual Information. Classical registration methods like ANTs show a strong correlation between the Dice Score of registered pairs, and the mutual information between the corresponding image and label across 4 brain datasets.
  • Figure 2: Performance of classical and unsupervised DLIR methods on OASIS data. Boxplots (top) show that classical methods on average are ranked higher than DLIR methods, both on the trainval and val splits. Interestingly, the performance of unsupervised DLIR methods does not improve on the trainval split compared to val split -- showing that deep learning does not have an intrinsic advantage in label alignment. Tables (bottom) of p-values show the results of a pairwise two-sided t-test between the performance of classical and DLIR methods on the trainval and val splits. denotes a cell where the classical method is significantly better than the DLIR method ($p < 0.01$), a denotes the opposite, denotes no significant difference. Most of the cells are , indicating that classical methods are significantly better than DLIR methods.
  • Figure 3: Instrumentation bias in evaluation of image registration algorithms. We highlight a significant difference in evaluation metrics reported by baselines and our evaluation on the OASIS validation dataset. This difference can be attributed to deviation in hyperparameters from the recommended parameters or early stopping to save time. In either case, this misrepresentation leads to incorrect conclusions about the performance of the algorithm. The reported dice scores are anywhere from 2 to 10 Dice points lower than our evaluation, showing a non-trivial instrumentation bias. We report our own evaluation of DLIR algorithms and compare them with reported values to avoid introducing instrumentation bias in our evaluation.
  • Figure 4: Performance of classical and supervised DLIR methods on OASIS data. Boxplots (top) show that DLIR methods show superior performance compared to classical methods. Unlike the unsupervised case, the effect of overfitting is clearly visible in the gap between the trainval and val splits. Tables (bottom) of p-values show the results of a pairwise two-sided t-test between the performance of classical and DLIR methods on the trainval and val splits. denotes a cell where the classical method is significantly better than the DLIR method ($p < 0.01$), a denotes the opposite, denotes no significant difference. State-of-the-art DLIR methods show significantly better performance than classical methods when label supervision is added.
  • Figure 5: Classical methods retain robustness across different datasets. Boxplots show the performance of classical and DLIR methods trained on the OASIS dataset, on four T1-brain datasets. For DLIR methods, we plot the performance of the supervised and unsupervised models. Across all datasets, FireANTs and ANTs consistently outperform DLIR methods, showing robustness to domain shift. Among DLIR methods, SynthMorph and TransMorph show robust performance, and training with label matching objective does not lead to significant improvement.