The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

Rohit Jena; Pratik Chaudhari; James C. Gee

The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

Rohit Jena, Pratik Chaudhari, James C. Gee

TL;DR

This work independently reevaluates the LUMIR zero-shot claims for deformable image registration, contrasting deep learning methods with iterative optimization across in-distribution T1w data and multiple out-of-distribution settings. It finds that DL methods can match iterative performance on familiar T1w images and improved task understanding in macaques, but falter on unseen contrasts, high-resolution data, and when preprocessing diverges from training conditions. The study emphasizes domain shift, labelmap dependencies, and preprocessing sensitivity as core limitations of zero-shot generalization, and advocates for evaluation protocols that reflect real-world clinical workflows. Overall, the results challenge universal zero-shot superiority and underscore the continued relevance of robust, scalable iterative methods for high-resolution and multimodal neuroimaging tasks.

Abstract

The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.

The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

TL;DR

Abstract

The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)