Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers
Saahil Islam, Venkatesh N. Murthy, Dominik Neumann, Badhan Kumar Das, Puneet Sharma, Andreas Maier, Dorin Comaniciu, Florin C. Ghesu
TL;DR
The paper tackles robust real-time device tracking in interventional X-ray sequences, a task hindered by occlusions, dynamic motion, and view changes. It introduces Frame Interpolation Masked Autoencoder (FIMAE), a self-supervised pretraining framework that learns spatio-temporal embeddings using a novel frame-interpolation masking strategy on a large unlabeled dataset. These pretrained features are transferred to a streamlined downstream tracker based on a Vision Transformer that jointly handles feature extraction and matching, achieving 42 frames per second and a 97.95% Tracking Success at the 3x threshold while reducing maximum error by up to 66.31% compared with optimized baselines. The approach demonstrates strong robustness across angiography, fluoroscopy, and device-occlusion scenarios, suggesting broad applicability to interventional image analytics and potential reductions in contrast usage and procedure time.
Abstract
An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.
