Surgical Vision World Model
Saurabh Koju, Saurav Bastola, Prashant Shrestha, Sanskar Amgain, Yash Raj Shrestha, Rudra P. K. Poudel, Binod Bhattarai
TL;DR
The paper tackles the lack of realistic, action-controllable surgical simulations by introducing SurgWM, a three-component world model that learns from unlabeled surgical videos. It combines a Video Tokenizer, a Surgical Latent Action Model, and a Surgical Dynamics Model based on spatio-temporal transformers to generate future frames conditioned on latent actions. Trained on the SurgToolLoc-2022 dataset in two stages, SurgWM achieves high-quality, action-conditioned frame generation and captures tool-tissue interactions, respiratory motion, and other real-world dynamics. The work demonstrates improved generation quality and controllability when conditioning on latent actions and more prompt frames, highlighting potential for training autonomous surgical agents and enhancing surgical training without extensive action annotations.
Abstract
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical-Vision-World-Model
