Multimodal Transformers for Real-Time Surgical Activity Prediction

Keshara Weerasinghe; Seyed Hamid Reza Roodabeh; Kay Hutchinson; Homa Alemzadeh

Multimodal Transformers for Real-Time Surgical Activity Prediction

Keshara Weerasinghe, Seyed Hamid Reza Roodabeh, Kay Hutchinson, Homa Alemzadeh

TL;DR

This paper presents a multimodal transformer architecture for real-time recognition and prediction of surgical gestures and trajectories based on short segments of kinematic and video data and conducts an ablation study to evaluate the impact of fusing different input modalities and their representations on gesture recognition and prediction performance.

Abstract

Real-time recognition and prediction of surgical activities are fundamental to advancing safety and autonomy in robot-assisted surgery. This paper presents a multimodal transformer architecture for real-time recognition and prediction of surgical gestures and trajectories based on short segments of kinematic and video data. We conduct an ablation study to evaluate the impact of fusing different input modalities and their representations on gesture recognition and prediction performance. We perform an end-to-end assessment of the proposed architecture using the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) dataset. Our model outperforms the state-of-the-art (SOTA) with 89.5\% accuracy for gesture prediction through effective fusion of kinematic features with spatial and contextual video features. It achieves the real-time performance of 1.1-1.3ms for processing a 1-second input window by relying on a computationally efficient model.

Multimodal Transformers for Real-Time Surgical Activity Prediction

TL;DR

Abstract

Paper Structure (18 sections, 3 figures, 4 tables)

This paper contains 18 sections, 3 figures, 4 tables.

INTRODUCTION
PRELIMINARIES
Surgical Gestures and Context
JIGSAWS Dataset
Gesture Recognition
Gesture and Trajectory Prediction
METHODS
Feature Extraction and Transformation
Gesture Recognition
Gesture and Trajectory Prediction
EXPERIMENTAL EVALUATION
Experimental Setup
Metrics
Results
Gesture Recognition
...and 3 more sections

Figures (3)

Figure 1: Overall Architecture for End-to-End Real-Time Surgical Activity Recognition and Prediction
Figure 2: A sample timeline of a Suturing trial, illustrating the gestures executed throughout the trial. Top Row: Actual gestures, Middle Row: Predicted gestures, Bottom Row: Error intervals (often occurring when transitioning to the next gestures).
Figure 3: Trajectory Prediction results for X-axis and Z-axis position of the right instrument for a subject in the Suturing task

Multimodal Transformers for Real-Time Surgical Activity Prediction

TL;DR

Abstract

Multimodal Transformers for Real-Time Surgical Activity Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)