Table of Contents
Fetching ...

The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

Rohit Jena, Pratik Chaudhari, James C. Gee

TL;DR

This work independently reevaluates the LUMIR zero-shot claims for deformable image registration, contrasting deep learning methods with iterative optimization across in-distribution T1w data and multiple out-of-distribution settings. It finds that DL methods can match iterative performance on familiar T1w images and improved task understanding in macaques, but falter on unseen contrasts, high-resolution data, and when preprocessing diverges from training conditions. The study emphasizes domain shift, labelmap dependencies, and preprocessing sensitivity as core limitations of zero-shot generalization, and advocates for evaluation protocols that reflect real-world clinical workflows. Overall, the results challenge universal zero-shot superiority and underscore the continued relevance of robust, scalable iterative methods for high-resolution and multimodal neuroimaging tasks.

Abstract

The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.

The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

TL;DR

This work independently reevaluates the LUMIR zero-shot claims for deformable image registration, contrasting deep learning methods with iterative optimization across in-distribution T1w data and multiple out-of-distribution settings. It finds that DL methods can match iterative performance on familiar T1w images and improved task understanding in macaques, but falter on unseen contrasts, high-resolution data, and when preprocessing diverges from training conditions. The study emphasizes domain shift, labelmap dependencies, and preprocessing sensitivity as core limitations of zero-shot generalization, and advocates for evaluation protocols that reflect real-world clinical workflows. Overall, the results challenge universal zero-shot superiority and underscore the continued relevance of robust, scalable iterative methods for high-resolution and multimodal neuroimaging tasks.

Abstract

The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.

Paper Structure

This paper contains 13 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of the three registration methods on the NIMH T1w dataset.Left shows violin plots of the Dice scores of the top iterative and deep learning registration methods on the NIMH T1w dataset. Right shows Cohen's d scores for all method pairs, quantifying the practical significance of the differences in Dice scores between the three registration methods.
  • Figure 2: Comparison of the three registration methods on the PRIME-DE dataset.Left shows violin plots of the Dice scores of tissue overlap (GM, WM, CSF), Right shows violin plots of the Dice scores of subcortical overlap between the registered and reference labelmaps.
  • Figure 3: Quantitative comparison of the three registration methods on the PRIME-DE dataset.Left shows the mean, median, and standard deviation of the Dice scores of the top three registration methods on the PRIME-DE dataset. Right shows Cohen's d scores for all method pairs.
  • Figure 4: Comparison of the three registration methods on out-of-distribution contrasts on the NIMH dataset with labels generated by SynthSeg.
  • Figure 5: Multimodal characterization of the Ultracortex dataset.Left shows axial slices of subjects from the Ultracortex dataset. Out of 12 subjects with labeled segmentations, 3 subjects have MP-RAGE sequence data, and 9 subjects have MP2RAGE sequence data. Right shows histograms of the intensity values of the subjects. The MP2RAGE sequences are characterized by two or three peaks close to the extreme values of the intensity range, while the MP-RAGE sequences have a more unimodal distribution with a single dominant peak. The qualitative differences in both the intensity values and histograms are indicative of the multimodal nature of the dataset.
  • ...and 2 more figures