Multi-Modal Gesture Recognition from Video and Surgical Tool Pose Information via Motion Invariants

Jumanh Atoum; Garrison L. H. Johnston; Nabil Simaan; Jie Ying Wu

Multi-Modal Gesture Recognition from Video and Surgical Tool Pose Information via Motion Invariants

Jumanh Atoum, Garrison L. H. Johnston, Nabil Simaan, Jie Ying Wu

TL;DR

This work investigates real-time gesture recognition in robotic surgery by incorporating geometry-aware kinematic features. It introduces motion invariants—curvature and torsion—computed from a striction-curve derived from screws of finite motion, and fuses these invariants with vision and pose data using a Relational Graph Network (MRG-Net). On the JIGSAWS suturing dataset, adding curvature and torsion to position data yields state-of-the-art frame-wise accuracy of 90.3% and an Edit Score of 89.0%, outperforming pose-only and quaternion-based representations. The results demonstrate the value of geometry-aware modeling for surgical gesture understanding, with potential impacts on real-time skill assessment and automation.

Abstract

Recognizing surgical gestures in real-time is a stepping stone towards automated activity recognition, skill assessment, intra-operative assistance, and eventually surgical automation. The current robotic surgical systems provide us with rich multi-modal data such as video and kinematics. While some recent works in multi-modal neural networks learn the relationships between vision and kinematics data, current approaches treat kinematics information as independent signals, with no underlying relation between tool-tip poses. However, instrument poses are geometrically related, and the underlying geometry can aid neural networks in learning gesture representation. Therefore, we propose combining motion invariant measures (curvature and torsion) with vision and kinematics data using a relational graph network to capture the underlying relations between different data streams. We show that gesture recognition improves when combining invariant signals with tool position, achieving 90.3\% frame-wise accuracy on the JIGSAWS suturing dataset. Our results show that motion invariant signals coupled with position are better representations of gesture motion compared to traditional position and quaternion representations. Our results highlight the need for geometric-aware modeling of kinematics for gesture recognition.

Multi-Modal Gesture Recognition from Video and Surgical Tool Pose Information via Motion Invariants

TL;DR

Abstract

Multi-Modal Gesture Recognition from Video and Surgical Tool Pose Information via Motion Invariants

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)