Friends Across Time: Multi-Scale Action Segmentation Transformer for Surgical Phase Recognition

Bokai Zhang; Jiayuan Meng; Bin Cheng; Dean Biskup; Svetlana Petculescu; Angela Chapman

Friends Across Time: Multi-Scale Action Segmentation Transformer for Surgical Phase Recognition

Bokai Zhang, Jiayuan Meng, Bin Cheng, Dean Biskup, Svetlana Petculescu, Angela Chapman

TL;DR

The paper tackles automatic surgical phase recognition by introducing a pair of Transformer-based models that capture multi-scale temporal information. The Multi-Scale Action Segmentation Transformer (MS-AST) targets offline recognition, while the Multi-Scale Action Segmentation Causal Transformer (MS-ASCT) enables online, causally constrained recognition, both leveraging multi-scale temporal self-attention and cross-attention. Across Cholec80, 50Salads, and GTEA datasets, the approach sets new state-of-the-art results on online/offline surgical phase recognition and non-medical action segmentation, with reported online and offline accuracies of 95.26% and 96.15% on Cholec80, respectively. The method remains robust to domain shifts, performing competitively on non-medical datasets using standard I3D features, and highlights practical potential for real-time operating room and video analysis workflows. The design balances spatial feature extraction with scalable temporal modeling to faithfully capture both fast and slow surgical actions within multi-scale windows.

Abstract

Automatic surgical phase recognition is a core technology for modern operating rooms and online surgical video assessment platforms. Current state-of-the-art methods use both spatial and temporal information to tackle the surgical phase recognition task. Building on this idea, we propose the Multi-Scale Action Segmentation Transformer (MS-AST) for offline surgical phase recognition and the Multi-Scale Action Segmentation Causal Transformer (MS-ASCT) for online surgical phase recognition. We use ResNet50 or EfficientNetV2-M for spatial feature extraction. Our MS-AST and MS-ASCT can model temporal information at different scales with multi-scale temporal self-attention and multi-scale temporal cross-attention, which enhances the capture of temporal relationships between frames and segments. We demonstrate that our method can achieve 95.26% and 96.15% accuracy on the Cholec80 dataset for online and offline surgical phase recognition, respectively, which achieves new state-of-the-art results. Our method can also achieve state-of-the-art results on non-medical datasets in the video action segmentation domain.

Friends Across Time: Multi-Scale Action Segmentation Transformer for Surgical Phase Recognition

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 5 figures, 8 tables)

This paper contains 16 sections, 2 equations, 5 figures, 8 tables.

Introduction
Method
Feature Extraction Network
Action Segmentation Network
Transformer for Action Segmentation
Multi-Scale Action Segmentation Transformer
Multi-Scale Action Segmentation Causal Transformer
Dataset
Experiments
Evaluation metrics
Implementation details
Results
Online surgical phase recognition
Offline surgical phase recognition
Action segmentation on non-medical datasets
...and 1 more sections

Figures (5)

Figure 1: The overview of our method
Figure 2: Multi-Scale Action Segmentation Transformer
Figure 3: Sliding Window Attention: (1) Non-causal (2) Causal
Figure 4: Color-coded ribbon illustration for online surgical phase recognition: (a) EffNetV2 Causal ASFormer (b) EffNetV2 MS-ASCT (c) Ground Truth
Figure 5: Normalized Confusion matrix of EffNetV2 MS-ASCT on the Cholec80 dataset for online surgical phase recognition

Friends Across Time: Multi-Scale Action Segmentation Transformer for Surgical Phase Recognition

TL;DR

Abstract

Friends Across Time: Multi-Scale Action Segmentation Transformer for Surgical Phase Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)