Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, Hongliang Ren
TL;DR
The paper tackles the challenge of grounded visual question answering in robotic surgery by extending a large vision-language model with domain-specific mechanisms. It introduces Surgical-LVLM, which adds Visual Perception LoRA blocks to a base LVLM (Qwen-VL) and a Token-Interaction module to align language outputs with visual grounding, followed by a two-stage training strategy consisting of vision-language instruction tuning and multimodal grounding alignment. The authors validate their approach on EndoVis-2017/2018 VQLA benchmarks and a new EndoVis Conversations dataset, achieving state-of-the-art grounding performance and superior VQA reasoning in complex surgical scenes. They also provide ablations that show the pivotal roles of VP-LoRA and instruction tuning in improving both language quality and grounding accuracy. The work advances automated surgical mentorship by delivering a context-aware, reasoning-capable LVLM tailored to the intricacies of surgical environments, while acknowledging safety, reliability, and deployment challenges that remain for real-world clinical use.
Abstract
Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and related region grounding have shown great promise for robotic and medical applications, addressing the critical need for automated methods in personalized surgical mentorship. However, existing models primarily provide simple structured answers and struggle with complex scenarios due to their limited capability in recognizing long-range dependencies and aligning multimodal information. In this paper, we introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios. Leveraging the pre-trained large vision-language model and specialized Visual Perception LoRA (VP-LoRA) blocks, our model excels in understanding complex visual-language tasks within surgical contexts. In addressing the visual grounding task, we propose the Token-Interaction (TIT) module, which strengthens the interaction between the grounding module and the language responses of the Large Visual Language Model (LVLM) after projecting them into the latent space. We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset, which sets new performance standards. Our work contributes to advancing the field of automated surgical mentorship by providing a context-aware solution.
