Table of Contents
Fetching ...

Innovative Tooth Segmentation Using Hierarchical Features and Bidirectional Sequence Modeling

Xinxin Zhao, Jian Jiang, Yan Tian, Liqin Wu, Zhaocheng Xu, Teddy Yang, Yunuo Zou, Xun Wang

TL;DR

A three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images and a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost.

Abstract

Tooth image segmentation is a cornerstone of dental digitization. However, traditional image encoders relying on fixed-resolution feature maps often lead to discontinuous segmentation and poor discrimination between target regions and background, due to insufficient modeling of environmental and global context. Moreover, transformer-based self-attention introduces substantial computational overhead because of its quadratic complexity (O(n^2)), making it inefficient for high-resolution dental images. To address these challenges, we introduce a three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images. By jointly leveraging low-level details and high-level semantics through cross-scale feature fusion, the model effectively preserves fine structural information while maintaining strong contextual awareness. Furthermore, a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost. We validate our method on two dental datasets, with experimental results demonstrating its superiority over existing approaches. On the OralVision dataset, our model achieves a 1.1% improvement in mean intersection over union (mIoU).

Innovative Tooth Segmentation Using Hierarchical Features and Bidirectional Sequence Modeling

TL;DR

A three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images and a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost.

Abstract

Tooth image segmentation is a cornerstone of dental digitization. However, traditional image encoders relying on fixed-resolution feature maps often lead to discontinuous segmentation and poor discrimination between target regions and background, due to insufficient modeling of environmental and global context. Moreover, transformer-based self-attention introduces substantial computational overhead because of its quadratic complexity (O(n^2)), making it inefficient for high-resolution dental images. To address these challenges, we introduce a three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images. By jointly leveraging low-level details and high-level semantics through cross-scale feature fusion, the model effectively preserves fine structural information while maintaining strong contextual awareness. Furthermore, a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost. We validate our method on two dental datasets, with experimental results demonstrating its superiority over existing approaches. On the OralVision dataset, our model achieves a 1.1% improvement in mean intersection over union (mIoU).
Paper Structure (19 sections, 7 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 7 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: For the dental segmentation dataset, we conducted a comprehensive comparison between our method and the advanced segmentation approach HQ-SAM in terms of architectural design. Across image processing tasks with varying resolutions, our method leverages optimized feature extraction strategies and efficient architectural design to maintain low latency while delivering high-quality segmentation results. The blue area represents the mask generated after image segmentation. The white dashed box visually highlights the differences between the segmentation masks generated by our method and those produced by HQ-SAM.
  • Figure 2: Overall structure of the proposed approach. The proposed approach adopts a classic encoder-decoder architecture. First, the input dental images undergo a three-stage downsampling process to capture image features at multiple scales. These extracted features are then combined with prompt vectors created by the prompt encoder and processed together by the mask decoder to generate segmentation masks that correspond to the original images. In BSB, the sigmoid linear unit (Silu) activation function combines the smoothness of the sigmoid with a linear component to enhance feature representation. The blue region indicates the mask generated following the image segmentation process.
  • Figure 3: Illustration of the comparison between the vanilla mamba block and the bidirectional sequence block. (a) Vanilla mamba blocks are scanned in a sequential order from start to finish. (b) The bidirectional sequence module scans in a forward and backward order. The green and purple dots represent patch blocks at different positions, while the arrows in different colors indicate various scanning orders.
  • Figure 4: In (a), we compare the FPS performance of our method with SAM, HQ-SAM, and EfficientSAM across different input image sizes. In (b), we present a quantitative evaluation of the impact of low-level detailed feature (LDF) aggregation on the model's mIoU. In (c), we analyze the comparative mIoU performance of our method and SAM under Gaussian noise conditions (standard deviation of 25). In (d), we illustrate the trend in mIoU variations between our method and SAM under random rotation conditions (angle range from -30° to 30°).
  • Figure 5: Comparison of aggregation of low-level detailed features (LDF). The results show that aggregating low-level detail features effectively mitigates noise from artifacts such as dental calculus and food residues, producing segmentation masks that are closer to the ground truth (GT). The blue regions represent the generated masks after segmentation, and the white dashed boxes highlight the differences in the segmentation results.
  • ...and 3 more figures