Table of Contents
Fetching ...

Multi-Modal Gesture Recognition from Video and Surgical Tool Pose Information via Motion Invariants

Jumanh Atoum, Garrison L. H. Johnston, Nabil Simaan, Jie Ying Wu

TL;DR

This work investigates real-time gesture recognition in robotic surgery by incorporating geometry-aware kinematic features. It introduces motion invariants—curvature and torsion—computed from a striction-curve derived from screws of finite motion, and fuses these invariants with vision and pose data using a Relational Graph Network (MRG-Net). On the JIGSAWS suturing dataset, adding curvature and torsion to position data yields state-of-the-art frame-wise accuracy of 90.3% and an Edit Score of 89.0%, outperforming pose-only and quaternion-based representations. The results demonstrate the value of geometry-aware modeling for surgical gesture understanding, with potential impacts on real-time skill assessment and automation.

Abstract

Recognizing surgical gestures in real-time is a stepping stone towards automated activity recognition, skill assessment, intra-operative assistance, and eventually surgical automation. The current robotic surgical systems provide us with rich multi-modal data such as video and kinematics. While some recent works in multi-modal neural networks learn the relationships between vision and kinematics data, current approaches treat kinematics information as independent signals, with no underlying relation between tool-tip poses. However, instrument poses are geometrically related, and the underlying geometry can aid neural networks in learning gesture representation. Therefore, we propose combining motion invariant measures (curvature and torsion) with vision and kinematics data using a relational graph network to capture the underlying relations between different data streams. We show that gesture recognition improves when combining invariant signals with tool position, achieving 90.3\% frame-wise accuracy on the JIGSAWS suturing dataset. Our results show that motion invariant signals coupled with position are better representations of gesture motion compared to traditional position and quaternion representations. Our results highlight the need for geometric-aware modeling of kinematics for gesture recognition.

Multi-Modal Gesture Recognition from Video and Surgical Tool Pose Information via Motion Invariants

TL;DR

This work investigates real-time gesture recognition in robotic surgery by incorporating geometry-aware kinematic features. It introduces motion invariants—curvature and torsion—computed from a striction-curve derived from screws of finite motion, and fuses these invariants with vision and pose data using a Relational Graph Network (MRG-Net). On the JIGSAWS suturing dataset, adding curvature and torsion to position data yields state-of-the-art frame-wise accuracy of 90.3% and an Edit Score of 89.0%, outperforming pose-only and quaternion-based representations. The results demonstrate the value of geometry-aware modeling for surgical gesture understanding, with potential impacts on real-time skill assessment and automation.

Abstract

Recognizing surgical gestures in real-time is a stepping stone towards automated activity recognition, skill assessment, intra-operative assistance, and eventually surgical automation. The current robotic surgical systems provide us with rich multi-modal data such as video and kinematics. While some recent works in multi-modal neural networks learn the relationships between vision and kinematics data, current approaches treat kinematics information as independent signals, with no underlying relation between tool-tip poses. However, instrument poses are geometrically related, and the underlying geometry can aid neural networks in learning gesture representation. Therefore, we propose combining motion invariant measures (curvature and torsion) with vision and kinematics data using a relational graph network to capture the underlying relations between different data streams. We show that gesture recognition improves when combining invariant signals with tool position, achieving 90.3\% frame-wise accuracy on the JIGSAWS suturing dataset. Our results show that motion invariant signals coupled with position are better representations of gesture motion compared to traditional position and quaternion representations. Our results highlight the need for geometric-aware modeling of kinematics for gesture recognition.

Paper Structure

This paper contains 10 sections, 9 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Screw of finite motion $\textit{\textdollaroldstyle}_t$ between two subsequent tool-tip frames. In this figure, $\{W\}$ denotes the world frame, $\{t\}$ denotes the tool-tip frame at time $t$, $\{t+1\}$ denotes the tool-tip frame at time $t+1$, $\mathbf{s}_0(t)$ denotes the closet point on the screw axis to the world frame origin, and $\hat{\mathbf{s}}(t)$ denotes a unit vector pointing along the direction of the screw axis.
  • Figure 2: Obtaining motion invariants from tool poses: (a) the screws of finite motion $\textit{\textdollaroldstyle}_i$ between subsequent poses and their common normals, (b) the Plücker line coordinate parameters $\hat{\mathbf{s}}_i$ and $\mathbf{s}_{0_i}$ and a spline curve approximating the curve of striction, (c) arc-length parametrization of the striction curve defining the curvature $\kappa(s)$ and torsion $\tau(s)$.
  • Figure 3: Network architecture used to extract the temporal feature from the kinematics signal. Variant kinematics refers to the tool-tip poses while invariant kinematics refers to the computed curvature and torsion. The outputs of the TCN and LSTM network are averaged.
  • Figure 4: Overall relational graph network showing three nodes for vision, and left and right kinematics. We show a single graph convolutional layer to show the interactions between nodes, adapted from long2021relational. The variant and invariant features (output from the network in Fig. \ref{['fig:pipeline_part1']}) are concatenated and passed to the corresponding arm node. The arrows represent the relations between different nodes. After message passing, the hidden features are passed to a fully connected network to estimate the gesture per frame.
  • Figure 5: The color-coded ribbons compare the two ablations with the highest and lowest gesture recognition accuracy: the highest from tool-tip position, curvature, and torsion, and the lowest from tool-tip position, rotation, curvature, and torsion.
  • ...and 1 more figures