Table of Contents
Fetching ...

H-Net: A Multitask Architecture for Simultaneous 3D Force Estimation and Stereo Semantic Segmentation in Intracardiac Catheters

Pedram Fekri, Mehrdad Zadeh, Javad Dargahi

TL;DR

The paper tackles the need for sensor-free tactile and visual feedback in cardiac catheterization by proposing H-Net, a multitask encoder–decoder architecture that processes two X-ray views to simultaneously segment the catheter from both angles and estimate a 3D force vector $(F_x, F_y, F_z)$ in an end-to-end pipeline. The model uses two parameter-shared sub-networks with dual segmentation heads and a regression head, leveraging stereo features to improve $F_z$ estimation while maintaining low computational complexity. A synthetic X-ray data generator simulates realistic fluoroscopy for RGB and two synthetic X-ray datasets, enabling robust training and evaluation across varying backgrounds. Empirical results show state-of-the-art performance on both segmentation (two heads) and 3D force estimation, with under $5\times10^5$ parameters, demonstrating practical impact for real-time, sensor-free catheter guidance and potential integration into autonomous or semi-autonomous interventional systems ($F_x$, $F_y$, $F_z$).

Abstract

The success rate of catheterization procedures is closely linked to the sensory data provided to the surgeon. Vision-based deep learning models can deliver both tactile and visual information in a sensor-free manner, while also being cost-effective to produce. Given the complexity of these models for devices with limited computational resources, research has focused on force estimation and catheter segmentation separately. However, there is a lack of a comprehensive architecture capable of simultaneously segmenting the catheter from two different angles and estimating the applied forces in 3D. To bridge this gap, this work proposes a novel, lightweight, multi-input, multi-output encoder-decoder-based architecture. It is designed to segment the catheter from two points of view and concurrently measure the applied forces in the x, y, and z directions. This network processes two simultaneous X-Ray images, intended to be fed by a biplane fluoroscopy system, showing a catheter's deflection from different angles. It uses two parallel sub-networks with shared parameters to output two segmentation maps corresponding to the inputs. Additionally, it leverages stereo vision to estimate the applied forces at the catheter's tip in 3D. The architecture features two input channels, two classification heads for segmentation, and a regression head for force estimation through a single end-to-end architecture. The output of all heads was assessed and compared with the literature, demonstrating state-of-the-art performance in both segmentation and force estimation. To the best of the authors' knowledge, this is the first time such a model has been proposed

H-Net: A Multitask Architecture for Simultaneous 3D Force Estimation and Stereo Semantic Segmentation in Intracardiac Catheters

TL;DR

The paper tackles the need for sensor-free tactile and visual feedback in cardiac catheterization by proposing H-Net, a multitask encoder–decoder architecture that processes two X-ray views to simultaneously segment the catheter from both angles and estimate a 3D force vector in an end-to-end pipeline. The model uses two parameter-shared sub-networks with dual segmentation heads and a regression head, leveraging stereo features to improve estimation while maintaining low computational complexity. A synthetic X-ray data generator simulates realistic fluoroscopy for RGB and two synthetic X-ray datasets, enabling robust training and evaluation across varying backgrounds. Empirical results show state-of-the-art performance on both segmentation (two heads) and 3D force estimation, with under parameters, demonstrating practical impact for real-time, sensor-free catheter guidance and potential integration into autonomous or semi-autonomous interventional systems (, , ).

Abstract

The success rate of catheterization procedures is closely linked to the sensory data provided to the surgeon. Vision-based deep learning models can deliver both tactile and visual information in a sensor-free manner, while also being cost-effective to produce. Given the complexity of these models for devices with limited computational resources, research has focused on force estimation and catheter segmentation separately. However, there is a lack of a comprehensive architecture capable of simultaneously segmenting the catheter from two different angles and estimating the applied forces in 3D. To bridge this gap, this work proposes a novel, lightweight, multi-input, multi-output encoder-decoder-based architecture. It is designed to segment the catheter from two points of view and concurrently measure the applied forces in the x, y, and z directions. This network processes two simultaneous X-Ray images, intended to be fed by a biplane fluoroscopy system, showing a catheter's deflection from different angles. It uses two parallel sub-networks with shared parameters to output two segmentation maps corresponding to the inputs. Additionally, it leverages stereo vision to estimate the applied forces at the catheter's tip in 3D. The architecture features two input channels, two classification heads for segmentation, and a regression head for force estimation through a single end-to-end architecture. The output of all heads was assessed and compared with the literature, demonstrating state-of-the-art performance in both segmentation and force estimation. To the best of the authors' knowledge, this is the first time such a model has been proposed
Paper Structure (8 sections, 7 equations, 5 figures, 2 tables)

This paper contains 8 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The process of generating synthetic X-Ray images from the RGB and their corresponding thresholded images.
  • Figure 2: The diagram depicts H-Net's architecture, which includes two sub-networks, each with an encoder and a decoder. All layers of the encoders and decoders are shared between the two sub-networks. The architecture also has two segmentation heads and a central force estimation head. Upon receiving two input images, the network outputs two segmentation maps and the estimated forces in 3D.
  • Figure 3: The diagram demonstrates two samples from RGB, synthetic XRay-1 and XRay-2 dataset. The outputs of H-Net for each sample are shown as well. For each dataset, H-Net was already trained on its train set.
  • Figure 4: The diagram demonstrates the predicted forces in $x$, $y$ and $z$, given 80 input samples of XRay-2's test set.
  • Figure 5: The diagram plots the histogram of errors for the output of H-Net's force estimation head on each dataset in subplot (a) to (c). It also figures the training and validation loss and accuracy of H-Net in subplot (d) to (f).