RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model

Shunlei Li; Jin Wang; Rui Dai; Wanyu Ma; Wing Yin Ng; Yingbai Hu; Zheng Li

RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model

Shunlei Li, Jin Wang, Rui Dai, Wanyu Ma, Wing Yin Ng, Yingbai Hu, Zheng Li

TL;DR

The paper addresses autonomous robotic scrub nurse handover in dynamic operating-room environments by proposing RoboNurse-VLA, a Vision-Language-Action model that fuses SAM 2 for visual segmentation with Llama 2 for action planning under voice commands. It introduces a compact VLA architecture trained with LoRA on an ex-vivo instrument dataset and demonstrates strong zero-shot generalization and robust fine-tuning performance, outperforming baselines like Octo, RT-2-X, OpenVLA, and Diffusion Policy. Key contributions include the first surgical instrument handover application of a VLA, integration of SAM 2 for difficult grasping, and high success rates on unseen and complex tools. The results show RoboNurse-VLA achieving high accuracy and fast inference, indicating significant potential for improving efficiency and safety in surgical settings, with future work focusing on clinical deployment and safety enhancements.

Abstract

In modern healthcare, the demand for autonomous robotic assistants has grown significantly, particularly in the operating room, where surgical tasks require precision and reliability. Robotic scrub nurses have emerged as a promising solution to improve efficiency and reduce human error during surgery. However, challenges remain in terms of accurately grasping and handing over surgical instruments, especially when dealing with complex or difficult objects in dynamic environments. In this work, we introduce a novel robotic scrub nurse system, RoboNurse-VLA, built on a Vision-Language-Action (VLA) model by integrating the Segment Anything Model 2 (SAM 2) and the Llama 2 language model. The proposed RoboNurse-VLA system enables highly precise grasping and handover of surgical instruments in real-time based on voice commands from the surgeon. Leveraging state-of-the-art vision and language models, the system can address key challenges for object detection, pose optimization, and the handling of complex and difficult-to-grasp instruments. Through extensive evaluations, RoboNurse-VLA demonstrates superior performance compared to existing models, achieving high success rates in surgical instrument handovers, even with unseen tools and challenging items. This work presents a significant step forward in autonomous surgical assistance, showcasing the potential of integrating VLA models for real-world medical applications. More details can be found at https://robonurse-vla.github.io.

RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model

TL;DR

Abstract

Paper Structure (12 sections, 8 figures, 1 table)

This paper contains 12 sections, 8 figures, 1 table.

Introduction
Related works
RoboNurse-VLA Model
Architecture
Data collection
Training
Experiments and results
Accuracy of Gladia ASR, Detector, and SAM 2
Zero-shot performance
Fine-tuned models for scrub nurse handover
Performance on unseen tools and difficult-to-grasp items
Conclusion

Figures (8)

Figure 2: RoboNurse-VLA model architecture. Given an image observation and a speech instruction, the model predicts robot control actions. The architecture consists of three key components: (1) a SAM 2 based vision module, (2) a projector that maps visual features to the language embedding space, and (3) the pretrained Llama 2 7B-parameter LLM in OpenVLA.
Figure 3: The workflow of vision module.
Figure 4: The category of ex-vivo surgical instruments and hand.
Figure 5: The setup of RoboNurse-VLA includes a microphone, a table with six surgical instruments, an Intel RealSense Depth Camera D415, and a UR5 robotic arm.
Figure 6: Zero-short evaluation tasks and results.
...and 3 more figures

RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model

TL;DR

Abstract

RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)