Evaluation of ChatGPT-4o and DeepSeek as tools for orthodontic health literacy in public dental education
In dentistry, AI applications are increasingly being explored for clinical decision support, patient education, and dental student training17. Large language models such as ChatGPT can complement traditional educational methods by serving as virtual tutors and interactive learning tools16. This study evaluated the accuracy, consistency, and response time of ChatGPT-4o and DeepSeek when answering bilingual multiple-choice questions on orthodontic health literacy. Notably, the comparative analysis reveals profound disparities in the domain-specific performance of ChatGPT-4o and DeepSeek. While the overall accuracy rates showed minimal difference in English versions (90.4% vs. 88.0%, p = 0.247, Cohen’s d = 0.077) but significant advantage for ChatGPT-4o in Chinese versions (91.3% vs. 83.6%, p < 0.001, Cohen’s d = 0.234), subgroup analyses uncovered more striking variations. A particularly notable finding is DeepSeek’s significant underperformance in Group A (Basic Orthodontic Knowledge), where its accuracy (50.0% in English, 55.6% in Chinese) substantially lagged behind ChatGPT-4o (82.2%, 83.3%) with large effect sizes (English: p < 0.001, Cohen’s d = 0.720; Chinese: p < 0.001, Cohen’s d = 0.629). The performance dynamics reversed in Group C, where DeepSeek achieved perfect accuracy in English (100.0% vs. 90.0%, p = 0.002, Cohen’s d = 0.469) and maintained strong performance in Chinese (98.9% vs. 93.3%, p = 0.055). However, this superior performance was not consistent across all advanced domains. In the Chinese versions of Groups D and E, ChatGPT-4o significantly outperformed DeepSeek (Group D: 100.0% vs. 93.3%, p = 0.013, Cohen’s d = 0.373; Group E: 100.0% vs. 90.0%, p = 0.002, Cohen’s d = 0.465), suggesting potential limitations in DeepSeek’s consistency when handling complex Chinese medical content. This pattern of struggling with foundational knowledge while demonstrating competence in complex domains suggests potential knowledge fragmentation within DeepSeek’s architecture, aligning with the concept of “capability overhang” where emergent abilities on complex tasks do not guarantee robustness on simpler, foundational ones18.
The wide confidence intervals for DeepSeek in Group A (39.3–60.7 in English) compared to the narrower ranges for ChatGPT-4o (72.7–89.5), coupled with the large effect sizes, further underscore the substantial instability of DeepSeek’s performance on core concepts. This may reflect differences in how the model handles lexical recall versus contextual reasoning. Group A items relied on definition matching, which can be sensitive to prompt phrasing or vocabulary shifts, especially across languages19. In contrast, Groups B and C required reasoning and pattern recognition, which align better with DeepSeek’s architecture focused on multi-step understanding and dense attention20.These findings underscore the need to match task types with model strengths when applying LLMs in health education.
The performance profiles suggest the models may have undergone divergent optimization pathways. ChatGPT-4o demonstrates high-floor performance, maintaining accuracy above 80% across all groups, indicative of robust generalization essential for clinical decision-support tools21. In contrast, DeepSeek exhibits a high-ceiling, low-floor profile, achieving perfect or near-perfect accuracy in Groups C, D, and E (90–100%) while struggling profoundly in Group A. This “all-or-nothing” pattern resembles models trained with emphasis on mastering complex, reasoning-intensive tasks at the potential expense of uniformly consolidating foundational knowledge22. The near-perfect scores in Group C (Treatment Decision-Making Capacity) for both models suggest that structured, scenario-based reasoning represents a shared strength of contemporary LLMs, possibly due to abundant similar formats in their training data.
Language condition did not significantly affect accuracy for ChatGPT-4o across groups, suggesting strong bilingual generalization. In contrast, DeepSeek showed significantly lower accuracy in Group D and Group E in the Chinese version (Table 3), indicating potential limitations in cross-linguistic generalization for complex or context-dependent content. This discrepancy may stem from imbalances in training data distribution or less extensive pretraining in non-English medical corpora. While previous studies have highlighted the multilingual strengths of ChatGPT-based models23, evidence also suggests that non-English performance of newer domestic models such as DeepSeek may be more sensitive to domain complexity and language-specific semantics24. This pattern is consistent with prior cross-lingual healthcare evaluations reporting higher response quality in English than in non-English languages25, and with GPT-4o’s documentation that reports significant improvements on non-English text compared with earlier GPT-4-series variants, which may partially explain the smaller English–Chinese gap for GPT-4o. In contrast, DeepSeek-V2’s public documentation indicates its training data are primarily Chinese and English, with evaluations run in both languages. However, differences in data composition and alignment, alongside the “curse of multilinguality” are identified as fundamental limitations26. These limitations are amplified when processing care- and follow-up-oriented content, which demands nuanced cultural-pragmatic adaptation, thereby explaining the pronounced discrepancies in Groups D and E.Such findings highlight the importance of evaluating AI tools not only by overall performance, but also by their ability to maintain cross-linguistic reliability—an essential consideration when applying them to multicultural, multilingual health education environments.
LLMs exhibit stable performance across different times of the day and days of testing when assessed in controlled environments27. By the results of this study, no significant influence of either time of day or testing day on model accuracy was observed, supporting these earlier findings. Additionally, the consistency rates between ChatGPT-4o and DeepSeek remained comparable across different groups and languages, further reinforcing the notion of stable AI outputs across diverse contexts. These results align with recent literature on the reproducibility of AI performance in healthcare education28. Such temporal stability is particularly valuable for public health education, where learners often seek information asynchronously; consistent model performance across time enhances the reliability of LLMs as tools for supporting self-guided learning and informed patient engagement.
To ensure standardization and reduce interpretive bias, the models in this study were instructed to respond using only the letter corresponding to the selected answer, without explanation. This “choices-only” approach has been widely adopted in benchmark studies to enable reproducible, accuracy-focused comparisons across models, languages, and content domains16. While explanatory responses are important in real-world health education, evaluating them requires separate rubrics for coherence, clarity, and source reliability, which were beyond the scope of this initial assessment. Prior work has also shown that explanations may introduce stylistic variability or hallucinated logic, complicating direct performance comparisons in early-stage LLM evaluations29.
ChatGPT-4o demonstrated significantly faster response times compared to DeepSeek in both language versions, which is consistent with previous studies highlighting the superior efficiency of GPT-4-based models in interactive educational applications30. In addition, The response time differences between ChatGPT-4o (< 2s) and DeepSeek (~ 5s) reveal distinct architectural approaches(Fig. 3). ChatGPT-4o’s speed advantage stems from its optimized multimodal architecture19, particularly beneficial for time-sensitive applications such as real-time patient communication or mobile-based orthodontic education. DeepSeek’s design prioritizes different objectives: (1) hybrid sparse-dense attention mechanisms for computational efficiency, (2) thorough response generation emphasizing quality over speed, and (3) optimized processing for complex reasoning tasks20. This trade-off makes DeepSeek potentially better suited for learning scenarios that demand detailed explanation and analytical depth, such as answering complex orthodontic case questions or explaining long-term treatment options.
Although efforts were made to control environmental and network conditions during timing assessments, minor confounders such as browser rendering latency, client-side processing, or server workload may have contributed to slight variations. Prior studies have highlighted that latency performance is influenced not only by model inference speed, but also by network transmission delays and user interface overhead31. Future research may employ automated scripts or direct API-level access to improve timing accuracy and eliminate observer bias—an approach aligned with current best practices in LLM benchmarking32.
Prior evaluations in dental education and orthodontics generally report that ChatGPT-class models can perform competitively on curricular or licensing-style content, yet topic-level variability is common and conclusions are often based on single-language, single-session, accuracy-only assessments8. In contrast, the present study contributes a bilingual, expert-validated item set spanning five dimensions of orthodontic health literacy, repeated measurements across days/time windows to characterize stability (reported as percent agreement), and response latency as a usability endpoint—thereby offering a more complete view of model behavior across accuracy, stability, and speed. This design aligns with current reporting guidance that emphasizes transparency and evaluability for clinical AI systems (CONSORT-AI/SPIRIT-AI; DECIDE-AI; TRIPOD + AI/TRIPOD-LLM)21. Recent comparative studies within dentistry/oral medicine that include ChatGPT-4o and DeepSeek tend to be single-shot evaluations on specific domains (e.g., oral pathology cases, dental anatomy, or mixed MCQs) and typically do not examine bilingual performance, time-of-day/day-to-day stability, or latency33. Our stratified analysis therefore complements this literature by revealing model- and language-specific heterogeneity at the group level (A–E), while aggregate performance remained comparable. In practical terms for orthodontic health literacy, these findings map onto patient-facing needs: foundational knowledge (Group A) requires stronger human oversight, whereas structured scenario-based counseling (Group C) appears to be a shared strength, suitable for adjunctive use in education and reinforcement34. In the context of health literacy and patient education, the findings are highly relevant. Comprehension, recall, and application underpin informed consent, adherence, and self-care, making clear bilingual communication and timely responsiveness essential35. By documenting bilingual accuracy, stability across sessions, and latency, the study delivers evidence that is stronger methodologically than single-session work and more usable in real settings8.
Several limitations warrant consideration. First, the Delphi process relied on a panel of five experts from a single institution. Although the panel was composed of senior specialists with substantial clinical and academic experience, and the consensus process was rigorously structured, the relatively small and homogeneous panel may limit the generalizability of the developed item set. Second, the 50-question set, although designed via the Delphi method, may not comprehensively reflect the full spectrum of orthodontic health knowledge relevant to the public11. Third, this study assessed AI-generated responses only in a simulated digital context; no human participants or real-world behavioral metrics were involved36. Thus, the educational impact on actual learning outcomes or decision-making processes was not measured. Fourth, key elements such as explanatory adequacy, source verifiability, and cultural contextualization were not assessed, despite their critical role in ensuring the safe and meaningful application of AI tools in public health education37. Additionally, the lack of a human comparison group limits the contextual understanding of model performance. Future studies should include participants such as laypersons and orthodontic patients to establish benchmarks for accuracy and reliability38. Such comparisons are crucial for evaluating the practical role of AI in supporting oral health literacy among the general public.
While this study focused on the comparative performance of two LLMs in orthodontic health education tasks, its scope was intentionally limited to a pairwise evaluation using a structured but concise question set. This exploratory design was not intended as a comprehensive benchmark, but rather as a foundational assessment. Notably, in contrast to prior dental/health-literacy evaluations that emphasized single-language, single-session accuracy, the present design integrates a bilingual, expert-validated item set, repeated measurements across days/sessions, and latency as a usability endpoint. This combination broadens the evidentiary base from correctness alone to include stability and user-relevant speed, improving interpretability for clinical and educational settings. Building upon this methodological advancement, future research should expand to include larger model cohorts, broader question banks, and multi-dimensional evaluation frameworks encompassing explanation quality, source transparency, and user learning outcomes.
AI holds strong potential as a complementary educational resource in the field of orthodontic health literacy. The findings of this study support the feasibility of applying general-purpose LLMs, such as ChatGPT-4o and DeepSeek, to simulate patient education scenarios across languages. As the accuracy and linguistic adaptability of these systems continue to evolve, they may become increasingly useful in health communication and public dental education.
link
