Table of Contents
Fetching ...

Instrument-tissue Interaction Detection Framework for Surgical Video Understanding

Wenjun Lin, Yan Hu, Huazhu Fu, Mingming Yang, Chin-Boon Chng, Ryo Kawasaki, Cheekong Chui, Jiang Liu

TL;DR

ITIDNet addresses instrument-tissue interaction detection in surgical videos by representing interactions as a quintuple and adopting a two-stage framework that first detects instrument/tissue instances and then predicts interactions. It introduces three key components—SCF, SCA, and TG layers—that fuse global video context, cross-frame relationships, and temporal graph-based reasoning to improve detection and action prediction. The authors validate their approach on two new datasets, PhacoQ and CholecQ, achieving state-of-the-art results over strong baselines. The work advances surgical scene understanding by combining precise localization with temporally-aware interaction reasoning, enabling more capable computer-assisted surgery systems.

Abstract

Instrument-tissue interaction detection task, which helps understand surgical activities, is vital for constructing computer-assisted surgery systems but with many challenges. Firstly, most models represent instrument-tissue interaction in a coarse-grained way which only focuses on classification and lacks the ability to automatically detect instruments and tissues. Secondly, existing works do not fully consider relations between intra- and inter-frame of instruments and tissues. In the paper, we propose to represent instrument-tissue interaction as <instrument class, instrument bounding box, tissue class, tissue bounding box, action class> quintuple and present an Instrument-Tissue Interaction Detection Network (ITIDNet) to detect the quintuple for surgery videos understanding. Specifically, we propose a Snippet Consecutive Feature (SCF) Layer to enhance features by modeling relationships of proposals in the current frame using global context information in the video snippet. We also propose a Spatial Corresponding Attention (SCA) Layer to incorporate features of proposals between adjacent frames through spatial encoding. To reason relationships between instruments and tissues, a Temporal Graph (TG) Layer is proposed with intra-frame connections to exploit relationships between instruments and tissues in the same frame and inter-frame connections to model the temporal information for the same instance. For evaluation, we build a cataract surgery video (PhacoQ) dataset and a cholecystectomy surgery video (CholecQ) dataset. Experimental results demonstrate the promising performance of our model, which outperforms other state-of-the-art models on both datasets.

Instrument-tissue Interaction Detection Framework for Surgical Video Understanding

TL;DR

ITIDNet addresses instrument-tissue interaction detection in surgical videos by representing interactions as a quintuple and adopting a two-stage framework that first detects instrument/tissue instances and then predicts interactions. It introduces three key components—SCF, SCA, and TG layers—that fuse global video context, cross-frame relationships, and temporal graph-based reasoning to improve detection and action prediction. The authors validate their approach on two new datasets, PhacoQ and CholecQ, achieving state-of-the-art results over strong baselines. The work advances surgical scene understanding by combining precise localization with temporally-aware interaction reasoning, enabling more capable computer-assisted surgery systems.

Abstract

Instrument-tissue interaction detection task, which helps understand surgical activities, is vital for constructing computer-assisted surgery systems but with many challenges. Firstly, most models represent instrument-tissue interaction in a coarse-grained way which only focuses on classification and lacks the ability to automatically detect instruments and tissues. Secondly, existing works do not fully consider relations between intra- and inter-frame of instruments and tissues. In the paper, we propose to represent instrument-tissue interaction as <instrument class, instrument bounding box, tissue class, tissue bounding box, action class> quintuple and present an Instrument-Tissue Interaction Detection Network (ITIDNet) to detect the quintuple for surgery videos understanding. Specifically, we propose a Snippet Consecutive Feature (SCF) Layer to enhance features by modeling relationships of proposals in the current frame using global context information in the video snippet. We also propose a Spatial Corresponding Attention (SCA) Layer to incorporate features of proposals between adjacent frames through spatial encoding. To reason relationships between instruments and tissues, a Temporal Graph (TG) Layer is proposed with intra-frame connections to exploit relationships between instruments and tissues in the same frame and inter-frame connections to model the temporal information for the same instance. For evaluation, we build a cataract surgery video (PhacoQ) dataset and a cholecystectomy surgery video (CholecQ) dataset. Experimental results demonstrate the promising performance of our model, which outperforms other state-of-the-art models on both datasets.
Paper Structure (30 sections, 3 equations, 6 figures, 7 tables)

This paper contains 30 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Examples of the existing instrument-tissue interaction recognition and our detection task. (a) Instrument-tissue interaction recognition task nwoye2020recognition, represented as triplets. (b) Our instrument-tissue interaction detection task, represented as quintuples.
  • Figure 2: Pipeline of the proposed ITIDNet models. Instruments and tissues are detected in the first stage while actions for instrument-tissue pairs are predicted in the second stage.
  • Figure 3: Illustration of the proposed instance detection model in the first stage. Extracting regional features using the Faster R-CNN backbone network, refining proposal features through the proposed SCF Layer and SCA Layer, and predicting the category and location of instruments and tissues. The proposed SCF Layer fuses global context information from the video snippet and exploits relationships of RoIs in the key frame $k$. The proposed SCA Layer utilizes relationships of RoIs in adjacent frames (from frame $k-r$ to $k$) guided by spatial encoding.
  • Figure 4: Illustration of the proposed Interaction Prediction model in the second stage. Instruments and tissues detected in stage 1 are sent to the feature extraction module to extract regional visual features first. These features are sent to the Temporal Graph Layer to predict actions between instruments and tissues. In this layer, an interaction graph is built to represent the instruments and tissues and the relationships between them.
  • Figure 5: Illustration of making the inter-frame connection for the instance $o$ between the key frame and one reference frame $r$. $N_r$ denotes the number of objects, which have the same class as $o$, in frame $r$.
  • ...and 1 more figures