Table of Contents
Fetching ...

Masked Latent Transformer with the Random Masking Ratio to Advance the Diagnosis of Dental Fluorosis

Yun Wu, Hao Xu, Maohua Gu, Zhongchuan Jiang, Jun Xu, Youliang Tian

TL;DR

This work tackles the automated diagnosis of dental fluorosis by addressing data scarcity with the first open-source DFID dataset and introducing MLTrMR, a masked latent transformer that leverages a random masking ratio to enhance context learning of fluorosis lesions. The model integrates a latent embedder, encoder, and decoder within a Vision Transformer framework, augmented by latent tokens, adaptive normalization, and relative positional biases; training is guiding by an auxiliary loss that aligns decoded features with the original image. Empirical results show MLTrMR achieves state-of-the-art performance on DFID, with accuracy $80.19\%$, F1 $75.79\%$, and qwKappa $81.28\%$, and ablations confirm the importance of the latent embedder, random masking, auxiliary loss, and RelPos-MSA. The dataset and method collectively advance non-invasive dental fluorosis diagnosis and offer a pathway toward scalable, automated screening in regions with high fluorosis prevalence.

Abstract

Dental fluorosis is a chronic disease caused by long-term overconsumption of fluoride, which leads to changes in the appearance of tooth enamel. It is an important basis for early non-invasive diagnosis of endemic fluorosis. However, even dental professionals may not be able to accurately distinguish dental fluorosis and its severity based on tooth images. Currently, there is still a gap in research on applying deep learning to diagnosing dental fluorosis. Therefore, we construct the first open-source dental fluorosis image dataset (DFID), laying the foundation for deep learning research in this field. To advance the diagnosis of dental fluorosis, we propose a pioneering deep learning model called masked latent transformer with the random masking ratio (MLTrMR). MLTrMR introduces a mask latent modeling scheme based on Vision Transformer to enhance contextual learning of dental fluorosis lesion characteristics. Consisting of a latent embedder, encoder, and decoder, MLTrMR employs the latent embedder to extract latent tokens from the original image, whereas the encoder and decoder comprising the latent transformer (LT) block are used to process unmasked tokens and predict masked tokens, respectively. To mitigate the lack of inductive bias in Vision Transformer, which may result in performance degradation, the LT block introduces latent tokens to enhance the learning capacity of latent lesion features. Furthermore, we design an auxiliary loss function to constrain the parameter update direction of the model. MLTrMR achieves 80.19% accuracy, 75.79% F1, and 81.28% quadratic weighted kappa on DFID, making it state-of-the-art (SOTA).

Masked Latent Transformer with the Random Masking Ratio to Advance the Diagnosis of Dental Fluorosis

TL;DR

This work tackles the automated diagnosis of dental fluorosis by addressing data scarcity with the first open-source DFID dataset and introducing MLTrMR, a masked latent transformer that leverages a random masking ratio to enhance context learning of fluorosis lesions. The model integrates a latent embedder, encoder, and decoder within a Vision Transformer framework, augmented by latent tokens, adaptive normalization, and relative positional biases; training is guiding by an auxiliary loss that aligns decoded features with the original image. Empirical results show MLTrMR achieves state-of-the-art performance on DFID, with accuracy , F1 , and qwKappa , and ablations confirm the importance of the latent embedder, random masking, auxiliary loss, and RelPos-MSA. The dataset and method collectively advance non-invasive dental fluorosis diagnosis and offer a pathway toward scalable, automated screening in regions with high fluorosis prevalence.

Abstract

Dental fluorosis is a chronic disease caused by long-term overconsumption of fluoride, which leads to changes in the appearance of tooth enamel. It is an important basis for early non-invasive diagnosis of endemic fluorosis. However, even dental professionals may not be able to accurately distinguish dental fluorosis and its severity based on tooth images. Currently, there is still a gap in research on applying deep learning to diagnosing dental fluorosis. Therefore, we construct the first open-source dental fluorosis image dataset (DFID), laying the foundation for deep learning research in this field. To advance the diagnosis of dental fluorosis, we propose a pioneering deep learning model called masked latent transformer with the random masking ratio (MLTrMR). MLTrMR introduces a mask latent modeling scheme based on Vision Transformer to enhance contextual learning of dental fluorosis lesion characteristics. Consisting of a latent embedder, encoder, and decoder, MLTrMR employs the latent embedder to extract latent tokens from the original image, whereas the encoder and decoder comprising the latent transformer (LT) block are used to process unmasked tokens and predict masked tokens, respectively. To mitigate the lack of inductive bias in Vision Transformer, which may result in performance degradation, the LT block introduces latent tokens to enhance the learning capacity of latent lesion features. Furthermore, we design an auxiliary loss function to constrain the parameter update direction of the model. MLTrMR achieves 80.19% accuracy, 75.79% F1, and 81.28% quadratic weighted kappa on DFID, making it state-of-the-art (SOTA).
Paper Structure (26 sections, 18 equations, 9 figures, 8 tables)

This paper contains 26 sections, 18 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison between normal dental and different severity degrees of dental fluorosis. (a) Normal. (b) Mild. (c) Moderate. (d) Severe. We invited 20 non-physicians(without medical or dental backgrounds), 10 non-dentists, and 10 dentists to manually assess images b, c, and d. Our statistics in (e) and (f) reveal the challenge faced in distinguishing between dental fluorosis and normal teeth. It's tough for non-physicians to distinguish them and non-dentists struggle to differentiate accurately between mild and moderate dental fluorosis. Notably, even dentists may not provide a completely accurate diagnosis.
  • Figure 2: Example of dental fluorosis. (a), (b), (c), and (d) show normal dental, mild dental fluorosis, moderate dental fluorosis, and severe dental fluorosis, respectively.
  • Figure 3: Data proportion statistics. (a) Distribution of age stages in participants. (b) Distribution of severity of dental fluorosis images.
  • Figure 4: The overall structure of MLTrMR. (a) is the structure of MLTrMR. The embedding process before the encoder is similar to ViT. However, in our model, the masking operation is performed with a random ratio during training but is unnecessary during inference. The latent tokens produced by the latent embedder and the image tokens are fed into the latent transformer (LT) block. (b) is the LT block, which replaces the standard layer normalization layer in the Transformer block with adaptive layer normalization (adaLN), and regresses the scale and shift parameters $\gamma$ and $\beta$, as well as the dimension-wise scaling parameter $\alpha$ via the latent tokens. We design the relative position multi-head self-attention (RelPos-MSA) to replace the multi-head self-attention (MSA).
  • Figure 5: Structure of the latent embedder. The CNN backbone network extracts latent features from the original image, followed by dimensionality reduction through adaptive average pooling. The embedding layer then acquires latent tokens.
  • ...and 4 more figures