Table of Contents
Fetching ...

Real-Time Glottis Detection Framework via Spatial-decoupled Feature Learning for Nasal Transnasal Intubation

Jinyu Liu, Gaoyang Zhang, Yang Zhou, Ruoyi Hao, Yang Zhang, Hongliang Ren

TL;DR

The proposed Mobile GlottisNet is a lightweight and efficient glottis detection framework designed for real time inference on embedded and edge devices, and implements a hierarchical dynamic thresholding strategy to enhance sample assignment and introduces an adaptive feature decoupling module based on deformable convolution to support dynamic spatial reconstruction.

Abstract

Nasotracheal intubation (NTI) is a vital procedure in emergency airway management, where rapid and accurate glottis detection is essential to ensure patient safety. However, existing machine assisted visual detection systems often rely on high performance computational resources and suffer from significant inference delays, which limits their applicability in time critical and resource constrained scenarios. To overcome these limitations, we propose Mobile GlottisNet, a lightweight and efficient glottis detection framework designed for real time inference on embedded and edge devices. The model incorporates structural awareness and spatial alignment mechanisms, enabling robust glottis localization under complex anatomical and visual conditions. We implement a hierarchical dynamic thresholding strategy to enhance sample assignment, and introduce an adaptive feature decoupling module based on deformable convolution to support dynamic spatial reconstruction. A cross layer dynamic weighting scheme further facilitates the fusion of semantic and detail features across multiple scales. Experimental results demonstrate that the model, with a size of only 5MB on both our PID dataset and Clinical datasets, achieves inference speeds of over 62 FPS on devices and 33 FPS on edge platforms, showing great potential in the application of emergency NTI.

Real-Time Glottis Detection Framework via Spatial-decoupled Feature Learning for Nasal Transnasal Intubation

TL;DR

The proposed Mobile GlottisNet is a lightweight and efficient glottis detection framework designed for real time inference on embedded and edge devices, and implements a hierarchical dynamic thresholding strategy to enhance sample assignment and introduces an adaptive feature decoupling module based on deformable convolution to support dynamic spatial reconstruction.

Abstract

Nasotracheal intubation (NTI) is a vital procedure in emergency airway management, where rapid and accurate glottis detection is essential to ensure patient safety. However, existing machine assisted visual detection systems often rely on high performance computational resources and suffer from significant inference delays, which limits their applicability in time critical and resource constrained scenarios. To overcome these limitations, we propose Mobile GlottisNet, a lightweight and efficient glottis detection framework designed for real time inference on embedded and edge devices. The model incorporates structural awareness and spatial alignment mechanisms, enabling robust glottis localization under complex anatomical and visual conditions. We implement a hierarchical dynamic thresholding strategy to enhance sample assignment, and introduce an adaptive feature decoupling module based on deformable convolution to support dynamic spatial reconstruction. A cross layer dynamic weighting scheme further facilitates the fusion of semantic and detail features across multiple scales. Experimental results demonstrate that the model, with a size of only 5MB on both our PID dataset and Clinical datasets, achieves inference speeds of over 62 FPS on devices and 33 FPS on edge platforms, showing great potential in the application of emergency NTI.
Paper Structure (25 sections, 5 equations, 7 figures, 5 tables)

This paper contains 25 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The comparative visualization with state-of-the-art methods on the PID dataset is illustrated, where the vertical axis represents the inference speed (FPS) measured on terminal devices. The bubble chart shows that our method achieves a favorable balance between AP50 and FPS while maintaining a compact model size. The area of each bubble corresponds to the model size.
  • Figure 2: Overview of the proposed Mobile GlottisNet. The input image is first processed by the backbone network to generate multi-scale feature maps, which are then fused through the FPN to produce enhanced multi-scale feature representations. Subsequently, the classification and regression tasks are decoupled, and dynamic label assignment strategies are applied to adjust anchor point allocation. The mean IoU of candidate boxes is calculated to set an adaptive threshold. Finally, convolutional kernel sampling positions are adjusted based on the Offset Field in the adaptive feature decoupling module, and the final detection results are derived from class scores and bounding box offsets.
  • Figure 3: Attention heatmaps throughout the network. Feature disentanglement progressively refines spatial attention, culminating in accurate focus on the anatomical glottic aperture.
  • Figure 4: The nasotracheal intubation robot system is an advanced medical system that improves intubation accuracy and minimizes human error. It utilizes a fiberoptic bronchoscope with a bendable robotic arm, dynamically adjusting insertion paths through real-time feedback control.
  • Figure 5: Sample images from the Glottis dataset. The dataset includes high-speed videoendoscopic recordings from both healthy individuals and patients with laryngeal disorders, collected across different clinical settings and imaging devices.
  • ...and 2 more figures