Table of Contents
Fetching ...

Artificial Intelligence to Assess Dental Findings from Panoramic Radiographs -- A Multinational Study

Yin-Chih Chelsea Wang, Tsao-Lun Chen, Shankeeth Vinayahalingam, Tai-Hsien Wu, Chu Wei Chang, Hsuan Hao Chang, Hung-Jen Wei, Mu-Hsiung Chen, Ching-Chang Ko, David Anssari Moin, Bram van Ginneken, Tong Xi, Hsiao-Cheng Tsai, Min-Huey Chen, Tzu-Ming Harry Hsu, Hye Chou

TL;DR

This study tackles the challenge of interpreting dental panoramic radiographs by building a three-stage AI system that localizes findings, classifies tooth indices, and post-processes outputs to yield per-tooth, per-finding probabilities across 8 dental categories. Trained on a Netherlands dataset and evaluated on multinational external sets (Netherlands, Brazil, Taiwan), the AI achieved a macro-averaged AUC-ROC of 96.2% across findings and matched or surpassed human readers on most metrics, while dramatically reducing reading time to about 1.55 seconds per image. The work demonstrates robust cross-national generalization, with strong performance on implants, root canal fillings, and crown/bridges, and reveals opportunities for AI to augment clinical workflows where time and accuracy are critical. It also provides a comprehensive multinational benchmark, detailed methodological transparency, and human-vs-AI comparisons to inform integration into dental practice and future improvements in AI-assisted radiographic interpretation.

Abstract

Dental panoramic radiographs (DPRs) are widely used in clinical practice for comprehensive oral assessment but present challenges due to overlapping structures and time constraints in interpretation. This study aimed to establish a solid baseline for the AI-automated assessment of findings in DPRs by developing, evaluating an AI system, and comparing its performance with that of human readers across multinational data sets. We analyzed 6,669 DPRs from three data sets (the Netherlands, Brazil, and Taiwan), focusing on 8 types of dental findings. The AI system combined object detection and semantic segmentation techniques for per-tooth finding identification. Performance metrics included sensitivity, specificity, and area under the receiver operating characteristic curve (AUC-ROC). AI generalizability was tested across data sets, and performance was compared with human dental practitioners. The AI system demonstrated comparable or superior performance to human readers, particularly +67.9% (95% CI: 54.0%-81.9%; p < .001) sensitivity for identifying periapical radiolucencies and +4.7% (95% CI: 1.4%-8.0%; p = .008) sensitivity for identifying missing teeth. The AI achieved a macro-averaged AUC-ROC of 96.2% (95% CI: 94.6%-97.8%) across 8 findings. AI agreements with the reference were comparable to inter-human agreements in 7 of 8 findings except for caries (p = .024). The AI system demonstrated robust generalization across diverse imaging and demographic settings and processed images 79 times faster (95% CI: 75-82) than human readers. The AI system effectively assessed findings in DPRs, achieving performance on par with or better than human experts while significantly reducing interpretation time. These results highlight the potential for integrating AI into clinical workflows to improve diagnostic efficiency and accuracy, and patient management.

Artificial Intelligence to Assess Dental Findings from Panoramic Radiographs -- A Multinational Study

TL;DR

This study tackles the challenge of interpreting dental panoramic radiographs by building a three-stage AI system that localizes findings, classifies tooth indices, and post-processes outputs to yield per-tooth, per-finding probabilities across 8 dental categories. Trained on a Netherlands dataset and evaluated on multinational external sets (Netherlands, Brazil, Taiwan), the AI achieved a macro-averaged AUC-ROC of 96.2% across findings and matched or surpassed human readers on most metrics, while dramatically reducing reading time to about 1.55 seconds per image. The work demonstrates robust cross-national generalization, with strong performance on implants, root canal fillings, and crown/bridges, and reveals opportunities for AI to augment clinical workflows where time and accuracy are critical. It also provides a comprehensive multinational benchmark, detailed methodological transparency, and human-vs-AI comparisons to inform integration into dental practice and future improvements in AI-assisted radiographic interpretation.

Abstract

Dental panoramic radiographs (DPRs) are widely used in clinical practice for comprehensive oral assessment but present challenges due to overlapping structures and time constraints in interpretation. This study aimed to establish a solid baseline for the AI-automated assessment of findings in DPRs by developing, evaluating an AI system, and comparing its performance with that of human readers across multinational data sets. We analyzed 6,669 DPRs from three data sets (the Netherlands, Brazil, and Taiwan), focusing on 8 types of dental findings. The AI system combined object detection and semantic segmentation techniques for per-tooth finding identification. Performance metrics included sensitivity, specificity, and area under the receiver operating characteristic curve (AUC-ROC). AI generalizability was tested across data sets, and performance was compared with human dental practitioners. The AI system demonstrated comparable or superior performance to human readers, particularly +67.9% (95% CI: 54.0%-81.9%; p < .001) sensitivity for identifying periapical radiolucencies and +4.7% (95% CI: 1.4%-8.0%; p = .008) sensitivity for identifying missing teeth. The AI achieved a macro-averaged AUC-ROC of 96.2% (95% CI: 94.6%-97.8%) across 8 findings. AI agreements with the reference were comparable to inter-human agreements in 7 of 8 findings except for caries (p = .024). The AI system demonstrated robust generalization across diverse imaging and demographic settings and processed images 79 times faster (95% CI: 75-82) than human readers. The AI system effectively assessed findings in DPRs, achieving performance on par with or better than human experts while significantly reducing interpretation time. These results highlight the potential for integrating AI into clinical workflows to improve diagnostic efficiency and accuracy, and patient management.

Paper Structure

This paper contains 19 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Workflow for Dental Panoramic Radiograph Collection and Assessment Study. Dental panoramic radiographs were collected from three geographic locations, totaling $6669.0$ studies: $5245.0$ from the Netherlands, $1173.0$ from Brazil, and $251.0$ from Taiwan. The initial data repositories before exclusion was much larger, with "$\approx$" denoting their original sizes. Data from the Netherlands were divided into a training set for artificial intelligence (AI) training and validation, and a test set designated as the internal test set. Both the Brazil and Taiwan data sets were utilized exclusively as external test sets for evaluation. The AI system was trained and validated on $4044.0$ imaging cases from the Netherlands, and was tasked with analyzing all three test sets to assess its generalizability. Additionally, both AI and dentists read a randomly selected subset of the Taiwan data to facilitate a comparative analysis of performance.
  • Figure 2: Overview of the Data Modalities and Artificial Intelligence (AI) Workflow. This schematic illustrates the AI process flow, beginning with dental panoramic radiographs (DPRs) as the primary input and culminating in a detailed finding assessment that labels whether each of the 8 findings is present in each of the 32 teeth (hence 256 binary labels per image). The AI system operated through three main stages: dental finding localization, tooth index classification, and post-processing. For AI training, we utilized contour labels annotated by general dental practitioners on the Netherlands training set. Note that only positive finding labels are shown in the finding assessment in the workflow, and for each positive finding, it is indicated by a finding type and a tooth number, which was annotated using the FDI World Dental Federation notation, or, ISO-3950 notation in this study. The FDI notation categorizes teeth into four quadrants, each having eight teeth, resulting in a number ranging from 11 to 48.
  • Figure 3: Comparison of AI System Performance with Human Readers on Dental Finding Assessment in Taiwan. This figure presents the receiver operating characteristic (ROC) curves for the AI system alongside the performance of 4 human readers on the Taiwan* test subset. Each plot displays a specific dental finding, with the shaded vertical areas representing the 95% confidence intervals (CI) for sensitivity along the ROC curve. The AI system’s operating point, indicated by "×", was determined by maximizing the $\textrm{F}_2$ score on a held-out validation set, tailored for screening scenarios. Figure insets magnify the critical regions of interest within each graph, providing a detailed view of performance near the operating point.
  • Figure 4: AI Generalization Performance across Multinational Data Sets. This figure evaluates the AI's capability to generalize its performance across different geographic data sets, focusing on the assessment of DPRs. The operating point of the AI system was optimized to maximize the $\textrm{F}_2$ score on a held-out validation set, simulating a screening scenario. Each bar represents the AI's performance metric for a specific dental finding within a dataset, with the 95% confidence intervals shown as error bars. Notably, the $y$-axes for some metrics do not start from zero to highlight specific performance ranges. Cohen's Kappa values among pairs of the human readers (G1, G2, S1, and S2) were computed and displayed along the AI's Kappa against the reference, serving as a contextual upper limit for AI performance. A comprehensive exploration on the inter-human-reader agreements is included in Supplementary Material.
  • Figure 5: Inter-Reader Agreement Levels for Dental Finding Summaries in Taiwan. This figure presents the agreement levels between pairs of readers as measured by Cohen's Kappa for various dental findings. Each blue error bar illustrates the Kappa agreement between each of the six possible reader pairings (G1/G2, G1/S1, G1/S2, G2/S1, G2/S2, and S1/S2) from four participating readers, grouped into general dentists (G1 and G2) and specialists (S1 and S2). The agreement levels are averaged for pairs within the same expertise group (generalists or specialists) and across different expertise, shown as red lines. The error bars represent the 95% confidence intervals for the Kappa values. Significance testing using $t$-statistics revealed that differences in mean agreement levels for residual roots were statistically significant ($p = .039$), while other findings showed no significant differences.