Table of Contents
Fetching ...

Surgical Triplet Recognition via Diffusion Model

Daochang Liu, Axel Hu, Mubarak Shah, Chang Xu

TL;DR

DiffTriplet is a new generative framework for surgical triplet recognition employing the diffusion model, which predicts surgical triplets via iterative denoising, and two unique designs are proposed in this diffusion framework, i.e., association learning and association guidance.

Abstract

Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms. The goal is to identify the combinations of instruments, verbs, and targets presented in surgical video frames. In this paper, we propose DiffTriplet, a new generative framework for surgical triplet recognition employing the diffusion model, which predicts surgical triplets via iterative denoising. To handle the challenge of triplet association, two unique designs are proposed in our diffusion framework, i.e., association learning and association guidance. During training, we optimize the model in the joint space of triplets and individual components to capture the dependencies among them. At inference, we integrate association constraints into each update of the iterative denoising process, which refines the triplet prediction using the information of individual components. Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition. Our codes will be released.

Surgical Triplet Recognition via Diffusion Model

TL;DR

DiffTriplet is a new generative framework for surgical triplet recognition employing the diffusion model, which predicts surgical triplets via iterative denoising, and two unique designs are proposed in this diffusion framework, i.e., association learning and association guidance.

Abstract

Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms. The goal is to identify the combinations of instruments, verbs, and targets presented in surgical video frames. In this paper, we propose DiffTriplet, a new generative framework for surgical triplet recognition employing the diffusion model, which predicts surgical triplets via iterative denoising. To handle the challenge of triplet association, two unique designs are proposed in our diffusion framework, i.e., association learning and association guidance. During training, we optimize the model in the joint space of triplets and individual components to capture the dependencies among them. At inference, we integrate association constraints into each update of the iterative denoising process, which refines the triplet prediction using the information of individual components. Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition. Our codes will be released.
Paper Structure (10 sections, 8 equations, 4 figures, 3 tables)

This paper contains 10 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of our DiffTriplet framework. During training, the model is trained to denoise noisy sequences in the joint space of triplets and individual components. During inference, the model iteratively generates the prediction by gradually reducing the noise starting from a pure noise sequence. In the joint space matrices $\mathrm{x}^{\mathtt{J}}, \bar{\mathrm{x}}^{\mathtt{J}}$, the blue rows are triplets $\mathrm{x}^{\mathtt{IVT}}, \bar{\mathrm{x}}^{\mathtt{IVT}}$, red rows are instruments $\mathrm{x}^{\mathtt{I}}, \bar{\mathrm{x}}^{\mathtt{I}}$, green rows are verbs $\mathrm{x}^{\mathtt{V}}, \bar{\mathrm{x}}^{\mathtt{V}}$, and purple rows are targets $\mathrm{x}^{\mathtt{T}}, \bar{\mathrm{x}}^{\mathtt{T}}$. Darker colors means higher probabilities. For the model $f_\theta$, a causal temporal model is used.
  • Figure 2: Visualization of the dependency matrix $M_{\mathtt{I}} \in \{0,1\}^{C_{\mathtt{I}} \times C_{\mathtt{IVT}}}$. Yellow cells (value=1) are the possible associations between the triplet and the instrument. Blue cells (value=0) are the impossible associations between the triplet and the instrument.
  • Figure 3: Visualization of the dependency matrix $M_{\mathtt{V}} \in \{0,1\}^{C_{\mathtt{V}} \times C_{\mathtt{IVT}}}$. Yellow cells (value=1) are the possible associations between the triplet and the verb. Blue cells (value=0) are the impossible associations between the triplet and the verb.
  • Figure 4: Visualization of the dependency matrix $M_{\mathtt{T}} \in \{0,1\}^{C_{\mathtt{T}} \times C_{\mathtt{IVT}}}$. Yellow cells (value=1) are the possible associations between the triplet and the target. Blue cells (value=0) are the impossible associations between the triplet and the target.