Table of Contents
Fetching ...

Hierarchical Context Transformer for Multi-level Semantic Scene Understanding

Luoying Hao, Yan Hu, Yang Yue, Li Wu, Huazhu Fu, Jinming Duan, Jiang Liu

TL;DR

The paper tackles the lack of holistic, hierarchical understanding in surgical scene analysis by introducing MSSU, which jointly models phase ($p$), step ($s$), and fine-grained actions/instruments ($a$, $t$) through a Hierarchical Context Transformer (HCT). Core innovations include the Hierarchical Relation Aggregation Module (HRAM) to fuse cross-task information, Inter-task Contrastive Learning (ICL) to align task representations, and the spatial-temporal adapters in HCT+ to enable efficient transfer learning. Experiments on a private cataract dataset and the public PSI-AVA dataset demonstrate state-of-the-art performance across phase, step, instrument, and action tasks, with ablations validating the contributions of HRAM and ICL and adapter-based efficiency. The approach offers a scalable, end-to-end framework for comprehensive surgical scene understanding, facilitating context-aware computer-assisted systems and advanced clinical analytics, while maintaining practical computational costs through ST-Ada integration.

Abstract

A comprehensive and explicit understanding of surgical scenes plays a vital role in developing context-aware computer-assisted systems in the operating theatre. However, few works provide systematical analysis to enable hierarchical surgical scene understanding. In this work, we propose to represent the tasks set [phase recognition --> step recognition --> action and instrument detection] as multi-level semantic scene understanding (MSSU). For this target, we propose a novel hierarchical context transformer (HCT) network and thoroughly explore the relations across the different level tasks. Specifically, a hierarchical relation aggregation module (HRAM) is designed to concurrently relate entries inside multi-level interaction information and then augment task-specific features. To further boost the representation learning of the different tasks, inter-task contrastive learning (ICL) is presented to guide the model to learn task-wise features via absorbing complementary information from other tasks. Furthermore, considering the computational costs of the transformer, we propose HCT+ to integrate the spatial and temporal adapter to access competitive performance on substantially fewer tunable parameters. Extensive experiments on our cataract dataset and a publicly available endoscopic PSI-AVA dataset demonstrate the outstanding performance of our method, consistently exceeding the state-of-the-art methods by a large margin. The code is available at https://github.com/Aurora-hao/HCT.

Hierarchical Context Transformer for Multi-level Semantic Scene Understanding

TL;DR

The paper tackles the lack of holistic, hierarchical understanding in surgical scene analysis by introducing MSSU, which jointly models phase (), step (), and fine-grained actions/instruments (, ) through a Hierarchical Context Transformer (HCT). Core innovations include the Hierarchical Relation Aggregation Module (HRAM) to fuse cross-task information, Inter-task Contrastive Learning (ICL) to align task representations, and the spatial-temporal adapters in HCT+ to enable efficient transfer learning. Experiments on a private cataract dataset and the public PSI-AVA dataset demonstrate state-of-the-art performance across phase, step, instrument, and action tasks, with ablations validating the contributions of HRAM and ICL and adapter-based efficiency. The approach offers a scalable, end-to-end framework for comprehensive surgical scene understanding, facilitating context-aware computer-assisted systems and advanced clinical analytics, while maintaining practical computational costs through ST-Ada integration.

Abstract

A comprehensive and explicit understanding of surgical scenes plays a vital role in developing context-aware computer-assisted systems in the operating theatre. However, few works provide systematical analysis to enable hierarchical surgical scene understanding. In this work, we propose to represent the tasks set [phase recognition --> step recognition --> action and instrument detection] as multi-level semantic scene understanding (MSSU). For this target, we propose a novel hierarchical context transformer (HCT) network and thoroughly explore the relations across the different level tasks. Specifically, a hierarchical relation aggregation module (HRAM) is designed to concurrently relate entries inside multi-level interaction information and then augment task-specific features. To further boost the representation learning of the different tasks, inter-task contrastive learning (ICL) is presented to guide the model to learn task-wise features via absorbing complementary information from other tasks. Furthermore, considering the computational costs of the transformer, we propose HCT+ to integrate the spatial and temporal adapter to access competitive performance on substantially fewer tunable parameters. Extensive experiments on our cataract dataset and a publicly available endoscopic PSI-AVA dataset demonstrate the outstanding performance of our method, consistently exceeding the state-of-the-art methods by a large margin. The code is available at https://github.com/Aurora-hao/HCT.

Paper Structure

This paper contains 24 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of our proposed method with existing methods. MSSU: multi-level semantic scene understanding in surgery.
  • Figure 2: The pipeline of our hierarchical context transformer (HCT) framework. Given an input video with $l+1$ frames, HCT uses a transformer model to extract shared features, which are then fed into the hierarchical relation aggregation module (HRAM) to capture the relations between the four task-wise features. After that, inter-task contractive learning (ICL) is utilized to further optimize the HCT. For the plus version of the model HCT+, we add a temporal adapter before HRAM and put the spatial adapter in the feed-forward module.
  • Figure 3: Detailed structure of the hierarchical relation aggregation module (HRAM).
  • Figure 4: Detailed structure of the inter-task contrastive learning (ICL).
  • Figure 5: Detailed structure of the spatial-temporal adapter in our proposed transformer block.
  • ...and 4 more figures