DARIL: When Imitation Learning outperforms Reinforcement Learning in Surgical Action Planning
Maxence Boels, Harry Robertshaw, Thomas C Booth, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin
TL;DR
The paper tackles surgical action planning by predicting future instrument-verb-target triplets for real-time guidance. It introduces DARIL, a dual-task autoregressive imitation learning framework, and conducts a comprehensive comparison against three RL variants on the CholecT50 dataset, assessing recognition and multi-horizon planning. DARIL achieves higher action triplet mAP and next-frame mAP, with smooth degradation across horizons, while RL approaches suffer substantial drops (e.g., world-model RL at 10s: $3.1\%$). The authors attribute RL underperformance to evaluation bias toward expert-like behavior and limited benefits from exploration in safety-critical domains, offering implications for when IL may be preferable and suggesting hybrid IL-RL strategies and simulation-based exploration for surgical AI.
Abstract
Surgical action planning requires predicting future instrument-verb-target triplets for real-time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through self-exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10-second horizons. We evaluated three RL variants: world model-based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL--world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert-annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.
