Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks
Ji Woong Kim, Tony Z. Zhao, Samuel Schmidgall, Anton Deguet, Marin Kobilarov, Chelsea Finn, Axel Krieger
TL;DR
The paper tackles learning surgical manipulation on the da Vinci robot despite inherently noisy forward kinematics by adopting a relative action framework. It evaluates three action representations—camera-centric, tool-centric, and hybrid-relative—finding that relative representations, especially hybrid-relative actions grounded in a fixed endoscope-tip frame for translations, yield the most robust imitation-learning performance when training with approximate kinematics $SE(3)$-based pose differences. Using action chunking transformers (ACT) and, to a lesser extent, diffusion policies, the authors demonstrate high success across tissue lift, needle handling, and knot-tying, with wrist camera input significantly enhancing performance in depth-sensitive phases. The study suggests that large repositories of approximate kinematics data can be leveraged for scalable autonomous surgery without kinematics corrections, while highlighting the practical value of wrist cameras for generalization and safety in real-world settings.
Abstract
We explore whether surgical manipulation tasks can be learned on the da Vinci robot via imitation learning. However, the da Vinci system presents unique challenges which hinder straight-forward implementation of imitation learning. Notably, its forward kinematics is inconsistent due to imprecise joint measurements, and naively training a policy using such approximate kinematics data often leads to task failure. To overcome this limitation, we introduce a relative action formulation which enables successful policy training and deployment using its approximate kinematics data. A promising outcome of this approach is that the large repository of clinical data, which contains approximate kinematics, may be directly utilized for robot learning without further corrections. We demonstrate our findings through successful execution of three fundamental surgical tasks, including tissue manipulation, needle handling, and knot-tying.
