MS-TCRNet: Multi-Stage Temporal Convolutional Recurrent Networks for Action Segmentation Using Sensor-Augmented Kinematics

Adam Goldbraikh; Omer Shubi; Or Rubin; Carla M Pugh; Shlomi Laufer

MS-TCRNet: Multi-Stage Temporal Convolutional Recurrent Networks for Action Segmentation Using Sensor-Augmented Kinematics

Adam Goldbraikh, Omer Shubi, Or Rubin, Carla M Pugh, Shlomi Laufer

TL;DR

This work tackles action segmentation from kinematic data in surgical contexts, introducing two MS-TCRNet variants (L-MS-TCRNet and G-MS-TCRNet) that fuse a TCN-based prediction generator with BiRNN refinements. A key innovation is intra-stage regularization, achieved by adding short prediction heads inside DDRLs, coupled with downsampling in refinements to reduce over-segmentation. The authors also propose two geometry-aware data augmentations, World Frame Rotation and Hand Inversion, to exploit the geometric structure of kinematic data and improve robustness across datasets. Evaluations on VTS, BRS, and JIGSAWS demonstrate state-of-the-art performance for kinematic data, with notable gains on left-handed surgeon data and across diverse data collection setups. The work advances practical surgical workflow analysis by delivering robust, geometry-aware action segmentation methods that generalize beyond RAMIS to other domains using kinematic traces.

Abstract

Action segmentation is a challenging task in high-level process analysis, typically performed on video or kinematic data obtained from various sensors. This work presents two contributions related to action segmentation on kinematic data. Firstly, we introduce two versions of Multi-Stage Temporal Convolutional Recurrent Networks (MS-TCRNet), specifically designed for kinematic data. The architectures consist of a prediction generator with intra-stage regularization and Bidirectional LSTM or GRU-based refinement stages. Secondly, we propose two new data augmentation techniques, World Frame Rotation and Hand Inversion, which utilize the strong geometric structure of kinematic data to improve algorithm performance and robustness. We evaluate our models on three datasets of surgical suturing tasks: the Variable Tissue Simulation (VTS) Dataset and the newly introduced Bowel Repair Simulation (BRS) Dataset, both of which are open surgery simulation datasets collected by us, as well as the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS), a well-known benchmark in robotic surgery. Our methods achieved state-of-the-art performance.

MS-TCRNet: Multi-Stage Temporal Convolutional Recurrent Networks for Action Segmentation Using Sensor-Augmented Kinematics

TL;DR

Abstract

Paper Structure (41 sections, 7 equations, 7 figures, 10 tables)

This paper contains 41 sections, 7 equations, 7 figures, 10 tables.

Introduction
Background
Our Contribution
Related Work
Action Segmentation
Kinematic Data and Analysis
Temporal Data Augmentations
Datasets
Variable Tissue Simulation (VTS) Dataset
Bowel Repair Simulation (BRS) Dataset
VTS and BRS Data Acquisition and Prepossessing
JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS)
Action Segmentation
Multi-Stage Temporal Convolutional Recurrent Networks
Prediction Generation Module
...and 26 more sections

Figures (7)

Figure 1: (A) Participants' hands with sensors, from the VTS dataset. In BRS the sensors are positioned in the same locations. The three datasets used to evaluate our algorithms and augmentations, (B) VTS - Variable Tissue Simulation Dataset, (C) the Bowel Repair Simulation Dataset, and (D) JHU-ISI Gesture and Skill Assessment Working Dataset.
Figure 2: General structure of the multi-stage network
Figure 3: Prediction generator with intra-unit regularization with a close-up view of a dual dilated residual layer (DDRL).
Figure 4: Sample points from two sensors, one on each hand, illustrating the Hand Inversion augmentation. The star- and cross-shaped points emphasize the transformations. The reflection plane given by the SVM output is represented by the purple line. A shows the original points in 3D, and B in the XY plane projection. The flipped points are shown in the XY plane in C and in 3D in D. Points in dark blue (orange) and light blue (red) represent points from the right (left) hand before and after augmentation respectively. Augmentation of orientation is not displayed.
Figure 5: The complete data flow in the training process is depicted here. A-E represents the input at different preprocessing stages, while F is the output, with dimensions equal to the number of classes multiplied by the number of time samples. A- shows the raw input containing the position and orientation of each sensor, B- displays the position and orientation after Hand Inversion augmentation, C- presents the position and orientation after World Frame Rotation augmentation, D- contains the calculated linear and angular velocities, and E- shows the normalized velocities.
...and 2 more figures

MS-TCRNet: Multi-Stage Temporal Convolutional Recurrent Networks for Action Segmentation Using Sensor-Augmented Kinematics

TL;DR

Abstract

MS-TCRNet: Multi-Stage Temporal Convolutional Recurrent Networks for Action Segmentation Using Sensor-Augmented Kinematics

Authors

TL;DR

Abstract

Table of Contents

Figures (7)