Table of Contents
Fetching ...

Multi-view Video-Pose Pretraining for Operating Room Surgical Activity Recognition

Idris Hamoud, Vinkle Srivastav, Muhammad Abdullah Jamal, Didier Mutter, Omid Mohareri, Nicolas Padoy

TL;DR

This paper tackles surgical activity recognition (SAR) from uncalibrated multi-view operating room videos by introducing PreViPS, a calibration-free, multi-view, multi-modal pretraining framework. It introduces Pose as Compositional Tokens (PCT) to tokenize continuous 2D poses and aligns pose embeddings with vision embeddings through CLIP-like objectives, geometric regularizers, and masked pose modeling. The method achieves data-efficient transfer and strong cross-view and unimodal performance on the 4D-OR and OR-AR datasets, while also benefiting single-view setups. Collectively, these contributions enable accurate SAR without calibrated multi-view cameras or expensive 3D scene graph processing, advancing practical surgical workflow understanding in real ORs.

Abstract

Understanding the workflow of surgical procedures in complex operating rooms requires a deep understanding of the interactions between clinicians and their environment. Surgical activity recognition (SAR) is a key computer vision task that detects activities or phases from multi-view camera recordings. Existing SAR models often fail to account for fine-grained clinician movements and multi-view knowledge, or they require calibrated multi-view camera setups and advanced point-cloud processing to obtain better results. In this work, we propose a novel calibration-free multi-view multi-modal pretraining framework called Multiview Pretraining for Video-Pose Surgical Activity Recognition PreViPS, which aligns 2D pose and vision embeddings across camera views. Our model follows CLIP-style dual-encoder architecture: one encoder processes visual features, while the other encodes human pose embeddings. To handle the continuous 2D human pose coordinates, we introduce a tokenized discrete representation to convert the continuous 2D pose coordinates into discrete pose embeddings, thereby enabling efficient integration within the dual-encoder framework. To bridge the gap between these two modalities, we propose several pretraining objectives using cross- and in-modality geometric constraints within the embedding space and incorporating masked pose token prediction strategy to enhance representation learning. Extensive experiments and ablation studies demonstrate improvements over the strong baselines, while data-efficiency experiments on two distinct operating room datasets further highlight the effectiveness of our approach. We highlight the benefits of our approach for surgical activity recognition in both multi-view and single-view settings, showcasing its practical applicability in complex surgical environments. Code will be made available at: https://github.com/CAMMA-public/PreViPS.

Multi-view Video-Pose Pretraining for Operating Room Surgical Activity Recognition

TL;DR

This paper tackles surgical activity recognition (SAR) from uncalibrated multi-view operating room videos by introducing PreViPS, a calibration-free, multi-view, multi-modal pretraining framework. It introduces Pose as Compositional Tokens (PCT) to tokenize continuous 2D poses and aligns pose embeddings with vision embeddings through CLIP-like objectives, geometric regularizers, and masked pose modeling. The method achieves data-efficient transfer and strong cross-view and unimodal performance on the 4D-OR and OR-AR datasets, while also benefiting single-view setups. Collectively, these contributions enable accurate SAR without calibrated multi-view cameras or expensive 3D scene graph processing, advancing practical surgical workflow understanding in real ORs.

Abstract

Understanding the workflow of surgical procedures in complex operating rooms requires a deep understanding of the interactions between clinicians and their environment. Surgical activity recognition (SAR) is a key computer vision task that detects activities or phases from multi-view camera recordings. Existing SAR models often fail to account for fine-grained clinician movements and multi-view knowledge, or they require calibrated multi-view camera setups and advanced point-cloud processing to obtain better results. In this work, we propose a novel calibration-free multi-view multi-modal pretraining framework called Multiview Pretraining for Video-Pose Surgical Activity Recognition PreViPS, which aligns 2D pose and vision embeddings across camera views. Our model follows CLIP-style dual-encoder architecture: one encoder processes visual features, while the other encodes human pose embeddings. To handle the continuous 2D human pose coordinates, we introduce a tokenized discrete representation to convert the continuous 2D pose coordinates into discrete pose embeddings, thereby enabling efficient integration within the dual-encoder framework. To bridge the gap between these two modalities, we propose several pretraining objectives using cross- and in-modality geometric constraints within the embedding space and incorporating masked pose token prediction strategy to enhance representation learning. Extensive experiments and ablation studies demonstrate improvements over the strong baselines, while data-efficiency experiments on two distinct operating room datasets further highlight the effectiveness of our approach. We highlight the benefits of our approach for surgical activity recognition in both multi-view and single-view settings, showcasing its practical applicability in complex surgical environments. Code will be made available at: https://github.com/CAMMA-public/PreViPS.

Paper Structure

This paper contains 35 sections, 8 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of our framework: (a) Given a video clip, we first extract all human poses using ViTPose-Base vitpose. We tokenize the poses using PCT PCT and use a two-stream approach with MaskFeat MaskFeat on the vision features. (b) We use different pretraining objectives on the global representations of each modality and viewpoint. (c) We present our finetuning protocol, utilizing global representations from various modalities and viewpoints. Additionally, we demonstrate the versatility of our approach, enabling us to train and test our methods using different viewpoints.
  • Figure 2: Activity label distribution: An overview of the activity durations in both 4D-OR (top) and OR-AR (bottom) datasets.
  • Figure 3: GradCAM visualizations: In the visualization of videos, brighter colors indicate higher attention. Notably, we observe that greater attention is assigned to moving body parts. The top row shows activation maps from our pretrained model with alignment objectives, while the bottom row displays results from the model trained without video-pose alignment.
  • Figure 4: Box-plots showing Accuracy distributions from 4D-OR clip classification experiment for different camera viewpoints available. Ablation was run using only the pose modality as input.