Table of Contents
Fetching ...

DaFoEs: Mixing Datasets towards the generalization of vision-state deep-learning Force Estimation in Minimally Invasive Robotic Surgery

Mikel De Iturrate Reyzabal, Mingcong Chen, Wei Huang, Sebastien Ourselin, Hongbin Liu

TL;DR

The paper tackles sensorless force estimation in minimally invasive robotic surgery by fusing vision and robot-state data across multiple datasets. It introduces DaFoEs, a variable-environment vision-haptic dataset, and a generalization pipeline that mixes data from DaFoEs with a dVRK dataset to improve cross-domain robustness. A ViT- or ResNet-based visual encoder paired with either a non-recurrent MLP or a recurrent LSTM decoder handles single-frame and sequence inputs, using a 5-frame temporal window and a 54-element generalized state vector; kinematic-aware augmentations further enhance generalization. Results show that dataset mixing reduces domain bias, with recurrent decoders delivering stronger cross-domain performance (mean relative errors around 5% vs. 12% for non-recurrent) and transformers gaining with more data, highlighting the potential for more variable data collection across hardware. The work suggests that mixing experimental setups is a viable path toward general sensorless force estimation in MIRS and motivates future data collection and modeling strategies to achieve robust real-time haptic feedback.

Abstract

Precisely determining the contact force during safe interaction in Minimally Invasive Robotic Surgery (MIRS) is still an open research challenge. Inspired by post-operative qualitative analysis from surgical videos, the use of cross-modality data driven deep neural network models has been one of the newest approaches to predict sensorless force trends. However, these methods required for large and variable datasets which are not currently available. In this paper, we present a new vision-haptic dataset (DaFoEs) with variable soft environments for the training of deep neural models. In order to reduce the bias from a single dataset, we present a pipeline to generalize different vision and state data inputs for mixed dataset training, using a previously validated dataset with different setup. Finally, we present a variable encoder-decoder architecture to predict the forces done by the laparoscopic tool using single input or sequence of inputs. For input sequence, we use a recurrent decoder, named with the prefix R, and a new temporal sampling to represent the acceleration of the tool. During our training, we demonstrate that single dataset training tends to overfit to the training data domain, but has difficulties on translating the results across new domains. However, dataset mixing presents a good translation with a mean relative estimated force error of 5% and 12% for the recurrent and non-recurrent models respectively. Our method, also marginally increase the effectiveness of transformers for force estimation up to a maximum of ~15%, as the volume of available data is increase by 150%. In conclusion, we demonstrate that mixing experimental set ups for vision-state force estimation in MIRS is a possible approach towards the general solution of the problem.

DaFoEs: Mixing Datasets towards the generalization of vision-state deep-learning Force Estimation in Minimally Invasive Robotic Surgery

TL;DR

The paper tackles sensorless force estimation in minimally invasive robotic surgery by fusing vision and robot-state data across multiple datasets. It introduces DaFoEs, a variable-environment vision-haptic dataset, and a generalization pipeline that mixes data from DaFoEs with a dVRK dataset to improve cross-domain robustness. A ViT- or ResNet-based visual encoder paired with either a non-recurrent MLP or a recurrent LSTM decoder handles single-frame and sequence inputs, using a 5-frame temporal window and a 54-element generalized state vector; kinematic-aware augmentations further enhance generalization. Results show that dataset mixing reduces domain bias, with recurrent decoders delivering stronger cross-domain performance (mean relative errors around 5% vs. 12% for non-recurrent) and transformers gaining with more data, highlighting the potential for more variable data collection across hardware. The work suggests that mixing experimental setups is a viable path toward general sensorless force estimation in MIRS and motivates future data collection and modeling strategies to achieve robust real-time haptic feedback.

Abstract

Precisely determining the contact force during safe interaction in Minimally Invasive Robotic Surgery (MIRS) is still an open research challenge. Inspired by post-operative qualitative analysis from surgical videos, the use of cross-modality data driven deep neural network models has been one of the newest approaches to predict sensorless force trends. However, these methods required for large and variable datasets which are not currently available. In this paper, we present a new vision-haptic dataset (DaFoEs) with variable soft environments for the training of deep neural models. In order to reduce the bias from a single dataset, we present a pipeline to generalize different vision and state data inputs for mixed dataset training, using a previously validated dataset with different setup. Finally, we present a variable encoder-decoder architecture to predict the forces done by the laparoscopic tool using single input or sequence of inputs. For input sequence, we use a recurrent decoder, named with the prefix R, and a new temporal sampling to represent the acceleration of the tool. During our training, we demonstrate that single dataset training tends to overfit to the training data domain, but has difficulties on translating the results across new domains. However, dataset mixing presents a good translation with a mean relative estimated force error of 5% and 12% for the recurrent and non-recurrent models respectively. Our method, also marginally increase the effectiveness of transformers for force estimation up to a maximum of ~15%, as the volume of available data is increase by 150%. In conclusion, we demonstrate that mixing experimental set ups for vision-state force estimation in MIRS is a possible approach towards the general solution of the problem.
Paper Structure (15 sections, 2 equations, 7 figures, 2 tables)

This paper contains 15 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Complete experimental setup used for the collection of the DaFoEs (Dataset for Force Estimation) dataset. The setup is divided into 3 main components, color coded: teleoperated robot arm (blue), master controller (green) and the forceps controller (red). In the left side of the image, we show the different possibilities for the soft tissue environment.
  • Figure 2: Example for horizontal mirroring transformation for our kinematic aware augmentation pipeline. In the image plane we have the visual transformation. In the lower part we have all the steps to update the kinematic vector of the robot. K stands for kinematics and IK stands for inverse kinematics.
  • Figure 3: Graphical representation of our training pipeline for the vision-state models. In the top right part, we show the different visual encoders that we used for this research (ResNet50 and Vision Transformer). After concatenation with the state vector, we have the two different types of decoders non-recurrent (MLP) or recurrent (LSTM).
  • Figure 4: Metrics to compare the effectiveness of our dataset mixing approach. The bars represent the dataset of origin of the testing clip. a) and b) represents the isolated training into a single dataset dVRK and DaFoEs respectively, and the translation experiment to the opposite dataset. c) Shows the force difference for the mixed dataset training.
  • Figure 5: Results for the feature isolation experiment as bar plots. The X axis shows the different models presented in the paper.
  • ...and 2 more figures