Table of Contents
Fetching ...

Teeth-SEG: An Efficient Instance Segmentation Framework for Orthodontic Treatment based on Anthropic Prior Knowledge

Bo Zou, Shaofeng Wang, Hao Liu, Gaoyue Sun, Yajie Wang, FeiFei Zuo, Chengbin Quan, Youjian Zhao

TL;DR

TeethSEG addresses the challenge of 2D intraoral tooth instance segmentation by introducing a ViT-based framework with Multi-Scale Aggregation blocks and an Anthropic Prior Knowledge layer. It couples a pretrained CLIP backbone with learnable tooth-id tokens, enhanced by a cross/self-gating mechanism and a permutation-based upscaler to produce sharp, tooth-aware segmentation masks. The APK layer injects orthodontist-inspired priors to improve labeling in complex dentitions, particularly with missing teeth. The IO150K dataset, built via a human–machine hybrid annotation workflow, supports robust evaluation and demonstrates that TeethSEG outperforms state-of-the-art 2D segmentation methods and generalizes to out-of-distribution and RGB-domain data, highlighting its potential for large-scale orthodontic screening and self-inspection applications.

Abstract

Teeth localization, segmentation, and labeling in 2D images have great potential in modern dentistry to enhance dental diagnostics, treatment planning, and population-based studies on oral health. However, general instance segmentation frameworks are incompetent due to 1) the subtle differences between some teeth' shapes (e.g., maxillary first premolar and second premolar), 2) the teeth's position and shape variation across subjects, and 3) the presence of abnormalities in the dentition (e.g., caries and edentulism). To address these problems, we propose a ViT-based framework named TeethSEG, which consists of stacked Multi-Scale Aggregation (MSA) blocks and an Anthropic Prior Knowledge (APK) layer. Specifically, to compose the two modules, we design 1) a unique permutation-based upscaler to ensure high efficiency while establishing clear segmentation boundaries with 2) multi-head self/cross-gating layers to emphasize particular semantics meanwhile maintaining the divergence between token embeddings. Besides, we collect 3) the first open-sourced intraoral image dataset IO150K, which comprises over 150k intraoral photos, and all photos are annotated by orthodontists using a human-machine hybrid algorithm. Experiments on IO150K demonstrate that our TeethSEG outperforms the state-of-the-art segmentation models on dental image segmentation.

Teeth-SEG: An Efficient Instance Segmentation Framework for Orthodontic Treatment based on Anthropic Prior Knowledge

TL;DR

TeethSEG addresses the challenge of 2D intraoral tooth instance segmentation by introducing a ViT-based framework with Multi-Scale Aggregation blocks and an Anthropic Prior Knowledge layer. It couples a pretrained CLIP backbone with learnable tooth-id tokens, enhanced by a cross/self-gating mechanism and a permutation-based upscaler to produce sharp, tooth-aware segmentation masks. The APK layer injects orthodontist-inspired priors to improve labeling in complex dentitions, particularly with missing teeth. The IO150K dataset, built via a human–machine hybrid annotation workflow, supports robust evaluation and demonstrates that TeethSEG outperforms state-of-the-art 2D segmentation methods and generalizes to out-of-distribution and RGB-domain data, highlighting its potential for large-scale orthodontic screening and self-inspection applications.

Abstract

Teeth localization, segmentation, and labeling in 2D images have great potential in modern dentistry to enhance dental diagnostics, treatment planning, and population-based studies on oral health. However, general instance segmentation frameworks are incompetent due to 1) the subtle differences between some teeth' shapes (e.g., maxillary first premolar and second premolar), 2) the teeth's position and shape variation across subjects, and 3) the presence of abnormalities in the dentition (e.g., caries and edentulism). To address these problems, we propose a ViT-based framework named TeethSEG, which consists of stacked Multi-Scale Aggregation (MSA) blocks and an Anthropic Prior Knowledge (APK) layer. Specifically, to compose the two modules, we design 1) a unique permutation-based upscaler to ensure high efficiency while establishing clear segmentation boundaries with 2) multi-head self/cross-gating layers to emphasize particular semantics meanwhile maintaining the divergence between token embeddings. Besides, we collect 3) the first open-sourced intraoral image dataset IO150K, which comprises over 150k intraoral photos, and all photos are annotated by orthodontists using a human-machine hybrid algorithm. Experiments on IO150K demonstrate that our TeethSEG outperforms the state-of-the-art segmentation models on dental image segmentation.
Paper Structure (18 sections, 7 equations, 9 figures, 10 tables)

This paper contains 18 sections, 7 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: The overview of TeethSEG. We utilize a pretrained encoder to project an intraoral image into a sequence of visual tokens, and a set of trainable class tokens to predict segmentation masks. The multi-scale aggregation (MSA) blocks efficiently aggregate the visual information into class tokens, and the anthropic prior knowledge (APK) layer imposes human judgment into the mask prediction.
  • Figure 2: Illustrations of Cross-Attention and our Cross-Gating mechanisms
  • Figure 3: Illustration of our human-machine hybrid data annotation process.
  • Figure 4: Examples of TeethSEG's segmentation results on IO150K RGB test split.
  • Figure 5: The trend of mIoU changes during the training process.
  • ...and 4 more figures