Table of Contents
Fetching ...

ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model

Satoshi Kondo

TL;DR

The paper addresses surgical phase recognition by focusing on representation learning for spatial features. It introduces ReSW-VL, a two-stage approach that fine-tunes a CLIP image encoder via prompt learning (while keeping the text encoder fixed) to produce robust frame-level representations, followed by frozen-image temporal modeling. Across three laparoscopic datasets, ReSW-VL variants consistently outperform conventional CNN-based methods, with performance differences attributed to dataset phase-sequentiality. The work demonstrates the potential of vision-language models for surgical workflow analysis and suggests extensions to larger CLIP backbones and Transformer-based temporal models for further gains.

Abstract

Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.

ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model

TL;DR

The paper addresses surgical phase recognition by focusing on representation learning for spatial features. It introduces ReSW-VL, a two-stage approach that fine-tunes a CLIP image encoder via prompt learning (while keeping the text encoder fixed) to produce robust frame-level representations, followed by frozen-image temporal modeling. Across three laparoscopic datasets, ReSW-VL variants consistently outperform conventional CNN-based methods, with performance differences attributed to dataset phase-sequentiality. The work demonstrates the potential of vision-language models for surgical workflow analysis and suggests extensions to larger CLIP backbones and Transformer-based temporal models for further gains.

Abstract

Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.

Paper Structure

This paper contains 5 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the first stage of the proposed method.
  • Figure 2: Overview of the second stage of the proposed method.
  • Figure 3: Qualitative results of the predictions for video 50 in Cholec80 dataset.