Accuracy of artificial intelligence in orthodontic extraction treatment planning: a systematic review and meta analysis

Table of Contents

Summary of findings

The purpose of this meta-analysis was to investigate the diagnostic performance of AI models for the prediction of dental extractions during orthodontic treatment planning. The pooled sensitivity and specificity of eligible studies were 0.70 (95% CI: 0.61–0.78) and 0.90 (95% CI: 0.87–0.92), respectively, showing moderate to high diagnostic accuracy. However, significant heterogeneity existed (I² = 96.7% for sensitivity and 93.7% for specificity, p = 0.000), which indicates that the performance of AI models may vary depending on the study populations and methodological approaches.

Subgroup analysis by model type provided further insights into performance variability. CNN-based models, demonstrated superior diagnostic performance with both high sensitivity (0.758–0.824) and specificity (0.931–0.941), and no observable heterogeneity. In contrast, MLP and RF models exhibited lower and more variable diagnostic performance, reflected in both their pooled estimates and high heterogeneity.

Moreover, the meta-regression analysis showed that studies with a higher prevalence of extraction cases tended to report higher sensitivity. However, this may limit the generalizability of the models, as their performance could be biased toward over-represented outcomes.

Also, the leave-one-out sensitivity analysis confirmed the robustness of the pooled sensitivity estimate in our meta-analysis.

These findings highlight both the potential and limitations of AI applications in orthodontics, underscoring the need for further investigation.

Comparison with previous research

Given the pooled sensitivity of 70% and specificity of 90%, it appears that AI models are capable of detecting dental extraction cases with an acceptable level of accuracy. These outcomes align with earlier research that showed AI’s effectiveness in orthodontic decision-making. For example, a study by Peilin et al. (2019) estimated the sensitivity and specificity of AI-based prediction models in orthodontic treatment planning to be 94.6% and 93.8%, respectively [20]. Similarly, Kuo et al. (2022) observed sensitivity and specificity rates of 92% and 91%, further supporting AI’s diagnostic potential in clinical practice [33]. Trehan et al. (2023) evaluated a convolutional neural network (ResNet-50) trained on profile photographs and reported accuracy in predicting extraction decisions, with the model correctly identifying 65.12% of extraction cases and 62.9% of non-extraction cases, illustrating notable variability in AI performance in orthodontic treatment planning [26]. This variability underscores the need for standardized assessment protocols, validation frameworks, continuous training, and governance in orthodontic AI applications to improve model performance and ensure clinical generalizability [34].

Our results also align with a recent meta-analysis by Evangelista et al. [35], which pooled data from six studies and reported an overall accuracy of 0.87 (95% CI: 0.75–0.96) for AI in orthodontic extraction decision-making. However, the study noted very low certainty of evidence due to methodological flaws and risk of bias, issues also reflected in our analysis.

Our analysis’s high heterogeneity (I² >85%) reflects significant differences in study design, population characteristics, and AI methodologies. For example, the study by Cirillo et al. [36] utilized a CNN (ResNet-50) and achieved a sensitivity of 74%, while Pal et al. [37] used a Random Forest algorithm and quoted a sensitivity of 90.6%. Our subgroup analysis confirms that model type significantly contributes to variability: CNN-based models, particularly ResNet and VGG, showed higher and more consistent diagnostic performance with minimal heterogeneity (I² = 0%). By contrast, models such as RF and MLP demonstrated greater variability in both sensitivity and specificity, coupled with high heterogeneity (I² >88%).

Differences in training datasets also played a substantial role. Models trained on more balanced datasets, such as that used by Trehan et al. [26] performed differently compared to those trained on highly imbalanced data, like the dataset used by Huang et al. [30], which contained a higher proportion of extraction cases. Our meta-regression results support this observation, revealing a significant association between higher extraction prevalence and increased sensitivity. This means that models trained on datasets where extraction cases are more common may learn to identify those cases more accurately. However, this could reduce the model’s ability to perform well on other datasets with different case distributions, which raises concerns about generalizability.

Population characteristics such as age and sex distribution also varied notably across studies and may have influenced model performance. For instance, Motmaen et al. [32] included a wide age range (11 to 99 years), potentially introducing anatomical diversity that could affect model accuracy. In contrast, other studies such as those by Etemad et al. [28] and Del Real et al. [29] focused on narrower age groups, primarily adolescents and young adults. Additionally, the absence of demographic data in several studies further complicates interpretation and may contribute to unexplained variance.

Finally, the variability observed across studies may also reflect differences in study quality, sample size, and adherence to methodological standards. As previously emphasized by Chaurasia et al. [38], comparing AI studies is challenging because of inconsistent reporting standards. Therefore, future studies should follow standardized guidelines for developing and evaluating AI models.

Visual inspection of the funnel plots revealed asymmetry for both sensitivity and specificity, suggesting potential publication bias. Also, the wide spread of data points around the regression line suggests variability in effect sizes and may reflect variability in study quality or methodological rigor [39]. As, Schulzke et al. [40] suggested reducing heterogeneity in meta-analysis needs high-quality and well-designed studies.

Strengths and limitations

The strengths are that the meta-analysis had an extended search strategy across multiple databases like PubMed, Scopus, Web Of Science, and Google Scholar; followed the PRISMA guidelines; and included studies from various geographical regions, which improves the generalizability of the results. However, the limitations included a small number of included studies (n = 7), which limits statistical power and generalizability. Additionally, since the included studies were conducted in only a few countries, the findings may not be representative of broader populations.Significant heterogeneity was observed (I² >85%), probably due to the difference in study design, populations, and AI algorithms, which complicates the interpretation of the pooled results. Further, the lack of detailed demographic and clinical data restricted subgroup analyses, and the cross-sectional design of the studies limits the ability to draw causal conclusions. Also, this study revealed high heterogeneity (I² = 96.7% for sensitivity, 93.7% for specificity) in AI model performance, driven by many factors: CNN-based models (ResNet, VGG) indicated low heterogeneity (I² = 0%) and superior performance, while traditional methods (RF, MLP) had high variability (I² >88%), likely due to inconsistent feature extraction. Extraction prevalence (25–54%) and demographic differences (sex, age) influenced sensitivity, with meta-regression showing higher prevalence linked to increased sensitivity (β = 0.9923, p = 0.050). Despite a moderate to high study quality (6/8–8/8), unclear reporting and funnel plot asymmetry indicated potential publication bias, which likely inflated performance estimates. In short, CNN models were most consistent, while dataset imbalances, design variability, and reporting gaps significantly contributed to heterogeneity. Standardization in methodology and transparency in reporting are needed to improve dependability.

Despite heterogeneity, pooling remains appropriate because it quantifies the expected confine of AI performance in real-world settings, highlighting the need for standardization. The 95% prediction intervals for specificity and sensitivity illustrate this variability, emphasizing that future models may perform anywhere within these bounds depending on context.

Despite these limitations, this review provides valuable insights and highlights the need for more large-scale, standardized studies. Future research should follow open reporting practices and consistent methodologies to enhance the reliability and applicability of AI in orthodontic treatment planning.

Implications for clinical practice

This meta-analysis provides evidence that AI models can support orthodontists in making evidence-based decisions regarding dental extractions. However, due to the observed heterogeneity and variability in diagnostic performance, these findings should be applied with caution. AI predictions should serve as complementary tools rather than definitive decision-makers, especially in complex cases where clinical judgment remains crucial. The development of standardized training datasets and robust validation protocols will be central to ensuring consistent and reliable AI performance across diverse patient populations.

Future directions

The limitations identified in this study highlight key areas for future research. There is a need for large-scale, multicenter studies employing standardized methodologies to more accurately determine the diagnostic performance of AI models in orthodontic treatment planning. Additionally, the adoption of transparent reporting frameworks—such as standardized guidelines designed for AI research, will support the development of more vigorous systematic reviews and meta-analyses. Further exploration of AI integration with advanced diagnostic tools, including 3D imaging and ML–based predictive analytics, may enhance clinical utility and improve decision-making in orthodontics.

link

Accuracy of artificial intelligence in orthodontic extraction treatment planning: a systematic review and meta analysis

Summary of findings

Comparison with previous research

Strengths and limitations

Implications for clinical practice

Future directions

More Stories

Potter Orthodontics Celebrates National Children’s Dental Health Month

Tools That Support Better Hygiene During Orthodontic Treatment

Navigating the many roads of modern orthodontics

Leave a Reply Cancel reply

ICE keeps using tear gas near children. What does it mean for their health?

B.C. health workers breached privacy of injured victims of Vancouver festival attack

Brain Health Challenge: Try the MIND Diet

The Hidden Sugar Crisis Threatening Our Kids’ Health

Summary of findings

Comparison with previous research

Strengths and limitations

Implications for clinical practice

Future directions

More Stories

Potter Orthodontics Celebrates National Children’s Dental Health Month

Tools That Support Better Hygiene During Orthodontic Treatment

Navigating the many roads of modern orthodontics

Leave a Reply Cancel reply

You may have missed

ICE keeps using tear gas near children. What does it mean for their health?

B.C. health workers breached privacy of injured victims of Vancouver festival attack

Brain Health Challenge: Try the MIND Diet

The Hidden Sugar Crisis Threatening Our Kids’ Health