Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks

Ji Woong Kim; Tony Z. Zhao; Samuel Schmidgall; Anton Deguet; Marin Kobilarov; Chelsea Finn; Axel Krieger

Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks

Ji Woong Kim, Tony Z. Zhao, Samuel Schmidgall, Anton Deguet, Marin Kobilarov, Chelsea Finn, Axel Krieger

TL;DR

The paper tackles learning surgical manipulation on the da Vinci robot despite inherently noisy forward kinematics by adopting a relative action framework. It evaluates three action representations—camera-centric, tool-centric, and hybrid-relative—finding that relative representations, especially hybrid-relative actions grounded in a fixed endoscope-tip frame for translations, yield the most robust imitation-learning performance when training with approximate kinematics $SE(3)$-based pose differences. Using action chunking transformers (ACT) and, to a lesser extent, diffusion policies, the authors demonstrate high success across tissue lift, needle handling, and knot-tying, with wrist camera input significantly enhancing performance in depth-sensitive phases. The study suggests that large repositories of approximate kinematics data can be leveraged for scalable autonomous surgery without kinematics corrections, while highlighting the practical value of wrist cameras for generalization and safety in real-world settings.

Abstract

We explore whether surgical manipulation tasks can be learned on the da Vinci robot via imitation learning. However, the da Vinci system presents unique challenges which hinder straight-forward implementation of imitation learning. Notably, its forward kinematics is inconsistent due to imprecise joint measurements, and naively training a policy using such approximate kinematics data often leads to task failure. To overcome this limitation, we introduce a relative action formulation which enables successful policy training and deployment using its approximate kinematics data. A promising outcome of this approach is that the large repository of clinical data, which contains approximate kinematics, may be directly utilized for robot learning without further corrections. We demonstrate our findings through successful execution of three fundamental surgical tasks, including tissue manipulation, needle handling, and knot-tying.

Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks

TL;DR

-based pose differences. Using action chunking transformers (ACT) and, to a lesser extent, diffusion policies, the authors demonstrate high success across tissue lift, needle handling, and knot-tying, with wrist camera input significantly enhancing performance in depth-sensitive phases. The study suggests that large repositories of approximate kinematics data can be leveraged for scalable autonomous surgery without kinematics corrections, while highlighting the practical value of wrist cameras for generalization and safety in real-world settings.

Abstract

Paper Structure (15 sections, 7 figures, 4 tables)

This paper contains 15 sections, 7 figures, 4 tables.

Introduction
Related Work
Manipulation and Imitation Learning
Autonomous Surgery
Technical Approach
Implementation Details
Experiments
Experiment Setup
Evaluating the Consistency of Relative Motion vs. Absolute Forward Kinematics
Policy Performance Using Various Action Representations
Evaluating the Importance of Wrist Camera
Evaluating Generalization
Limitations and Conclusion
Implementation Details
Generalization to Novel Settings

Figures (7)

Figure 1: (Left): The da Vinci Surgical Research Kit (dVRK) system is equipped with a surgical endoscope and wrist cameras. (Right): Three fundamental surgical tasks are learned, including lift tissue (i.e. tissue retraction), needle-pickup and handover, and knot-tying which are among the most common surgical tasks.
Figure 2: We propose a policy design which only takes images as input and outputs relative pose trajectories for both arms. Modeling policy actions as relative motion is a key ingredient that makes robot learning work on the dVRK.
Figure 3: The dVRK system consists of an endoscopic camera manipulator (ECM) and two patient side manipulators (PSM1, PSM2). Unfortunately, the dVRK arms are notorious for providing inconsistent forward kinematics. This is due to the setup joints (blue) only using potentiometers for joint measurements, which can be unrelible. The active joints (pink) use both potentiometers and motor encoders, improving precision.
Figure 4: We consider three options for modeling policy actions. (Left): Camera-centric approach models actions as absolute end-effector poses w.r.t the endoscope tip frame. (Middle): Tool-centric approach models actions as delta positions and delta rotations defined w.r.t the current end-effector frame. (Right): Hybrid relative approach models actions as delta positions defined w.r.t the endoscope tip frame and delta rotations defined w.r.t the current end-effector frame.
Figure 5: The repeatability of all action representations are tested by repeating a recorded reference trajectory under various robot configurations. (Left): The first column shows perfect reconstruction of the reference trajectory for all action representations since the robot joints have not moved since when the reference trajectory was collected. (Middle, Right) When the robot is shifted to the left or to the right, the camera-centric action representation fails to track the reference trajectory while the relative action representations track them quite closely. This is primarily due to the set-up joints being moved, which causes significant joint measurement errors. This experiment proves that in the presence of inconsistent joint measurements, relative motion can be more consistent.
...and 2 more figures

Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks

TL;DR

Abstract

Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)