ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model
Satoshi Kondo
TL;DR
The paper addresses surgical phase recognition by focusing on representation learning for spatial features. It introduces ReSW-VL, a two-stage approach that fine-tunes a CLIP image encoder via prompt learning (while keeping the text encoder fixed) to produce robust frame-level representations, followed by frozen-image temporal modeling. Across three laparoscopic datasets, ReSW-VL variants consistently outperform conventional CNN-based methods, with performance differences attributed to dataset phase-sequentiality. The work demonstrates the potential of vision-language models for surgical workflow analysis and suggests extensions to larger CLIP backbones and Transformer-based temporal models for further gains.
Abstract
Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.
