CLAIRE: Scalable GPU-Accelerated Algorithms for Diffeomorphic Image Registration in 3D

Andreas Mang

CLAIRE: Scalable GPU-Accelerated Algorithms for Diffeomorphic Image Registration in 3D

Andreas Mang

TL;DR

CLAIRE advances scalable, GPU-accelerated diffeomorphic image registration by casting it as a PDE-constrained variational problem controlled by a stationary velocity field. It employs a Gauss–Newton–Krylov solver with semi-Lagrangian time integration, mixed-precision GPU kernels, and MPI for memory-distributed computation, enabling 3D brain registrations on large volumes in a few seconds on a single GPU. The framework leverages sophisticated preconditioning (regularization, two-level, and zero-velocity approximations) and parameter continuation to achieve robust convergence and diffeomorphic guarantees, demonstrated on the NIREP dataset with high Dice scores (~0.83–0.84). This work offers a practical route to real-time or near-real-time large-scale diffeomorphic registration suitable for clinical workflows and large imaging studies, while outlining limitations and directions for extending to non-stationary velocities, multi-modality similarity measures, and topology-changing scenarios.

Abstract

We present our work on scalable, GPU-accelerated algorithms for diffeomorphic image registration. The associated software package is termed CLAIRE. Image registration is a non-linear inverse problem. It is about computing a spatial mapping from one image of the same object or scene to another. In diffeomorphic image registration, the set of admissible spatial transformations is restricted to maps that are smooth, one-to-one, and have a smooth inverse. We formulate diffeomorphic image registration as a variational problem governed by transport equations. We use an inexact, globalized (Gauss--)Newton--Krylov method for numerical optimization. We consider semi-Lagrangian methods for numerical time integration. Our solver features mixed-precision, hardware-accelerated computational kernels for optimal computational throughput. We use the message-passing interface for distributed-memory parallelism and deploy our code on modern high-performance computing architectures. Our solver allows us to solve clinically relevant problems in under four seconds on a single GPU. It can also be applied to large-scale 3D imaging applications with data that is discretized on meshes with billions of voxels. We demonstrate that our numerical framework yields high-fidelity results in only a few seconds, even if we search for an optimal regularization parameter.

CLAIRE: Scalable GPU-Accelerated Algorithms for Diffeomorphic Image Registration in 3D

TL;DR

Abstract

Paper Structure (29 sections, 58 equations, 10 figures, 6 tables, 2 algorithms)

This paper contains 29 sections, 58 equations, 10 figures, 6 tables, 2 algorithms.

Introduction
Outline of the Method
Related Work
Contributions
Limitations
Outline
Methods
Mathematical Foundations
Variational Problem Formulation
Optimality Conditions
Newton Step
Numerics
Discretization
Time Integration
Differentiation
...and 14 more sections

Figures (10)

Figure 1: Image registration problem. On the left, we show a volume rendering of a 3D brain MRI. The figures in the middle show an axial slice of two MRI brain scans of different individuals. In image registration, we seek a map $\boldsymbol{y} \in \mathfrak{Y}_{\text{ad}} \subset \{\boldsymbol{\phi} : \mathbb{R}^d \to \mathbb{R}^d\}$, $d\in\{2,3\}$, that establishes a plausible spatial correspondence between these to images. In this work, we restrict the set of admissible spatial transformations $\mathfrak{Y}_{\text{ad}}$ to $\mathbb{R}^d$-diffeomorphisms. On the right, we show the residual differences between the axial slices of the images shown in the middle before (left) and after (right) diffeomorphic (deformable) registration. Here, white represents small residual differences, and black indicates large residual differences. We note that the registration of two brains from different individuals is a common application for diffeomorphic image registration in computational anatomy. However, strictly speaking this example is in violation with our underlying assumptions; we do not register the "same object"---we register images of brains of different individuals (in an attempt to study anatomical variability).
Figure 2: Illustration of the computation of the characteristic $\boldsymbol{y}$ in the SL scheme. In the SL scheme, we have to compute the departure points at time $t^j$. To do so, we start with a regular grid at time $t^{j+1}$ and solve for the characteristic $\boldsymbol{y}_{\boldsymbol{l}}$ at a given point $\boldsymbol{x}_{\boldsymbol{l}}$ at mesh index $\boldsymbol{l}$ backward in time (green line in the graphic on the left). The deformed grid configuration is overlaid onto the initial regular grid at time $t^j$. (Figure modified from Brunn:2020aMang:2017a.)
Figure 3: Domain decomposition for memory-distributed implementation. In each case, we assume that we use four MPI tasks to distribute our data (e.g., four GPUs or four nodes). Left: 3D volume rendering of medical imaging data set (brain image). Middle: Slab decomposition (1D domain decomposition) considered in our GPU implementation Brunn:2020a. We decompose the spatial domain in the outer-most dimension. We transpose the data only once. On the right we illustrate the pencil decomposition (2D domain decomposition) of the data considered in our CPU implementation Mang:2016bMang:2019aGholami:2017a. The computation in this data layout involves three transposes.
Figure 4: NIREP data repository Christensen:2006a. We show an axial slice of each dataset (slice number 128). The repository contains 16 rigidly aligned T1-weighted MRI brain datasets ( na01-- na16) of size $256\times300\times256$ voxels of different individuals. Each dataset is equipped with 32 labels of anatomical gray matter regions. We overlay these regions in different colors on the MRI data. We refer to Christensen:2006a for additional information about the datasets, the imaging protocol, and the preprocessing.
Figure 5: Convergence of the PCG method for solving for the Newton step. We solve for the search direction at the solution of the registration problem (dataset na02 registered to na01). We consider the squared $L^2$-distance. We solve this problem at the original resolution of the data. We consider three different preconditioners: The regularization preconditioner, the zero-velocity approximation of the reduced space Hessian, and the 2-level implementation of the zero-velocity approximation of the reduced space Hessian. We solve the problem for three different regularization parameter values (from left to right): $\alpha = 1e-1$, $\alpha = 1e-2$, and $\alpha = 1e-3$. The tolerance for the PCG method is $1e-6$. We plot the relative residual.
...and 5 more figures

CLAIRE: Scalable GPU-Accelerated Algorithms for Diffeomorphic Image Registration in 3D

TL;DR

Abstract

CLAIRE: Scalable GPU-Accelerated Algorithms for Diffeomorphic Image Registration in 3D

Authors

TL;DR

Abstract

Table of Contents

Figures (10)