Table of Contents
Fetching ...

Multi-objective Cross-task Learning via Goal-conditioned GPT-based Decision Transformers for Surgical Robot Task Automation

Jiawei Fu, Yonghao Long, Kai Chen, Wang Wei, Qi Dou

TL;DR

This work tackles long-horizon, goal-conditioned surgical robot automation by introducing a goal-conditioned decision transformer that uses time-to-goal as a future indicator, enabling enhanced temporal reasoning. A two-stage training framework combines cross-task, multi-objective pretraining (action prediction, forward dynamics, time-to-goal, and sequence reconstruction) with downstream task learning, followed by hindsight data augmentation. The approach achieves superior performance and versatility across 10 SurRoL tasks and demonstrates practical trajectory deployment on the dVRK platform, indicating strong generalization and real-world applicability. Overall, the method advances task-agnostic reasoning for surgical robotics by leveraging GPT-based sequential modeling to learn and transfer goal-reaching dynamics without task-specific reward shaping.

Abstract

Surgical robot task automation has been a promising research topic for improving surgical efficiency and quality. Learning-based methods have been recognized as an interesting paradigm and been increasingly investigated. However, existing approaches encounter difficulties in long-horizon goal-conditioned tasks due to the intricate compositional structure, which requires decision-making for a sequence of sub-steps and understanding of inherent dynamics of goal-reaching tasks. In this paper, we propose a new learning-based framework by leveraging the strong reasoning capability of the GPT-based architecture to automate surgical robotic tasks. The key to our approach is developing a goal-conditioned decision transformer to achieve sequential representations with goal-aware future indicators in order to enhance temporal reasoning. Moreover, considering to exploit a general understanding of dynamics inherent in manipulations, thus making the model's reasoning ability to be task-agnostic, we also design a cross-task pretraining paradigm that uses multiple training objectives associated with data from diverse tasks. We have conducted extensive experiments on 10 tasks using the surgical robot learning simulator SurRoL~\cite{long2023human}. The results show that our new approach achieves promising performance and task versatility compared to existing methods. The learned trajectories can be deployed on the da Vinci Research Kit (dVRK) for validating its practicality in real surgical robot settings. Our project website is at: https://med-air.github.io/SurRoL.

Multi-objective Cross-task Learning via Goal-conditioned GPT-based Decision Transformers for Surgical Robot Task Automation

TL;DR

This work tackles long-horizon, goal-conditioned surgical robot automation by introducing a goal-conditioned decision transformer that uses time-to-goal as a future indicator, enabling enhanced temporal reasoning. A two-stage training framework combines cross-task, multi-objective pretraining (action prediction, forward dynamics, time-to-goal, and sequence reconstruction) with downstream task learning, followed by hindsight data augmentation. The approach achieves superior performance and versatility across 10 SurRoL tasks and demonstrates practical trajectory deployment on the dVRK platform, indicating strong generalization and real-world applicability. Overall, the method advances task-agnostic reasoning for surgical robotics by leveraging GPT-based sequential modeling to learn and transfer goal-reaching dynamics without task-specific reward shaping.

Abstract

Surgical robot task automation has been a promising research topic for improving surgical efficiency and quality. Learning-based methods have been recognized as an interesting paradigm and been increasingly investigated. However, existing approaches encounter difficulties in long-horizon goal-conditioned tasks due to the intricate compositional structure, which requires decision-making for a sequence of sub-steps and understanding of inherent dynamics of goal-reaching tasks. In this paper, we propose a new learning-based framework by leveraging the strong reasoning capability of the GPT-based architecture to automate surgical robotic tasks. The key to our approach is developing a goal-conditioned decision transformer to achieve sequential representations with goal-aware future indicators in order to enhance temporal reasoning. Moreover, considering to exploit a general understanding of dynamics inherent in manipulations, thus making the model's reasoning ability to be task-agnostic, we also design a cross-task pretraining paradigm that uses multiple training objectives associated with data from diverse tasks. We have conducted extensive experiments on 10 tasks using the surgical robot learning simulator SurRoL~\cite{long2023human}. The results show that our new approach achieves promising performance and task versatility compared to existing methods. The learned trajectories can be deployed on the da Vinci Research Kit (dVRK) for validating its practicality in real surgical robot settings. Our project website is at: https://med-air.github.io/SurRoL.
Paper Structure (22 sections, 1 equation, 4 figures, 2 tables)

This paper contains 22 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the architecture of the proposed model. For each timestep $t$, the sequence consists of four items: ${\hat{T}}_{t}$ (time-to-goal), $\mathbf{g}_{t}$ (goal), $\mathbf{o}_{t}$ (observation), $\mathbf{a}_{t}$ (action), which are embedded with the embedding of timestep and processed by the GPT architecture transformer backbone. In summary, the GPT backbone processes the input to predict results via specific heads. During pretraining and learning, sequences act as both input and target, guided by training objectives. In evaluation, we update a cached history sequence with model predictions and environmental data to forecast action $\mathbf{a}_{t}$.
  • Figure 2: Illustration of multiple training objectives that boost sequence contextual reasoning and understanding of task-agnostic paradigms of the model.
  • Figure 3: Ablation results for the available data amount for the 10 different tasks. The mean value and standard deviation of the success rate for all tasks with different amount of available data are visualized to evaluate the performance of the trained model.
  • Figure 4: Illustration of the trajectory deployment of 6 tasks in dVRK platform. The timestep in the entire episode is labeled.