Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers

Saahil Islam; Venkatesh N. Murthy; Dominik Neumann; Badhan Kumar Das; Puneet Sharma; Andreas Maier; Dorin Comaniciu; Florin C. Ghesu

Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers

Saahil Islam, Venkatesh N. Murthy, Dominik Neumann, Badhan Kumar Das, Puneet Sharma, Andreas Maier, Dorin Comaniciu, Florin C. Ghesu

TL;DR

The paper tackles robust real-time device tracking in interventional X-ray sequences, a task hindered by occlusions, dynamic motion, and view changes. It introduces Frame Interpolation Masked Autoencoder (FIMAE), a self-supervised pretraining framework that learns spatio-temporal embeddings using a novel frame-interpolation masking strategy on a large unlabeled dataset. These pretrained features are transferred to a streamlined downstream tracker based on a Vision Transformer that jointly handles feature extraction and matching, achieving 42 frames per second and a 97.95% Tracking Success at the 3x threshold while reducing maximum error by up to 66.31% compared with optimized baselines. The approach demonstrates strong robustness across angiography, fluoroscopy, and device-occlusion scenarios, suggesting broad applicability to interventional image analytics and potential reductions in contrast usage and procedure time.

Abstract

An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.

Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers

TL;DR

Abstract

Paper Structure (31 sections, 11 equations, 13 figures, 12 tables)

This paper contains 31 sections, 11 equations, 13 figures, 12 tables.

Introduction
Related Work
Self-Supervised Learning
Siamese Natural Image Tracking
Historical-Trajectory-based Natural Image Tracking
Device tracking in X-Ray:
Methods
Self-supervised Model Training
Learning space-time embeddings
Masking strategy based on frame interpolation
Encoder-Decoder Training
Pretraining Loss Function
Downstream Application: Device Tracking
Feature transfer
Multi-task Transformer Decoder
...and 16 more sections

Figures (13)

Figure 1: Tracking error ($\downarrow$) versus average speed ($\uparrow$) for catheter tip tracking in coronary X-ray sequences acquired during procedures such as invasive coronary angiography (ICA) or percutaneous coronary intervention (PCI): (a) showing average tracking error; and (b) showing maximum tracking error. Note that the average tracking error has $2$ different scales indicated with a horizontal break-point for better visualization. Runtime is measured on a Tesla V100 GPU.
Figure 2: Overview of key differences between our approach and previous approaches for device tracking.
Figure 3: Overview of our framework. First, the encoder is trained to learn spatio-temporal features from a large unlabeled dataset of angiography and fluorscopy using Frame Interpolation Masked Autoencoder (FIMAE) (left). Then, the weights are transfered into ViT encoder for feature extraction and feature matching for tracking the catheter tip (right).
Figure 4: Schematic visualization of tube-frame masking.
Figure 5: Distribution of the datasets based on the Field of View (Positioner Primary angle and Positioner Secondary angle): The left plot denotes the unlabled dataset ($\mathcal{D}_u$) and the right plot denotes the catheter tip dataset ($\mathcal{D}_l$).
...and 8 more figures

Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers

TL;DR

Abstract

Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers

Authors

TL;DR

Abstract

Table of Contents

Figures (13)