Table of Contents
Fetching ...

On the robustness of ChatGPT in teaching Korean Mathematics

Phuong-Nam Nguyen, Quang Nguyen-The, An Vu-Minh, Diep-Anh Nguyen, Xuan-Lam Pham

TL;DR

This study evaluates the robustness of ChatGPT in solving Korean-language mathematics problems drawn from the CSAT, using a dataset of $586$ questions and assessing both solving accuracy ($66.72\%$) and the model's ability to rate/c categorize questions. It analyzes eleven rating criteria via Likert scales, conducts topic analysis, and examines correlations between predicted and actual difficulty, cognitive demand, time, and engagement. Key findings show strong alignment between ChatGPT's ratings and educational theory, but weaknesses in sequential reasoning and diagram-based tasks; bilingual prompt strategies and multi-step prompting can modestly improve performance. The work provides actionable insights for applying LLMs in multilingual math education, including adaptive-learning opportunities and design considerations for diagram-rich content, while outlining directions for future research and dataset development. The results underscore the potential and limitations of deploying AI-assisted tools in non-English, STEM-rich educational settings and highlight avenues to enhance accuracy and inclusivity through language-aware fine-tuning and domain-specific data.

Abstract

ChatGPT, an Artificial Intelligence model, has the potential to revolutionize education. However, its effectiveness in solving non-English questions remains uncertain. This study evaluates ChatGPT's robustness using 586 Korean mathematics questions. ChatGPT achieves 66.72% accuracy, correctly answering 391 out of 586 questions. We also assess its ability to rate mathematics questions based on eleven criteria and perform a topic analysis. Our findings show that ChatGPT's ratings align with educational theory and test-taker perspectives. While ChatGPT performs well in question classification, it struggles with non-English contexts, highlighting areas for improvement. Future research should address linguistic biases and enhance accuracy across diverse languages. Domain-specific optimizations and multilingual training could improve ChatGPT's role in personalized education.

On the robustness of ChatGPT in teaching Korean Mathematics

TL;DR

This study evaluates the robustness of ChatGPT in solving Korean-language mathematics problems drawn from the CSAT, using a dataset of questions and assessing both solving accuracy () and the model's ability to rate/c categorize questions. It analyzes eleven rating criteria via Likert scales, conducts topic analysis, and examines correlations between predicted and actual difficulty, cognitive demand, time, and engagement. Key findings show strong alignment between ChatGPT's ratings and educational theory, but weaknesses in sequential reasoning and diagram-based tasks; bilingual prompt strategies and multi-step prompting can modestly improve performance. The work provides actionable insights for applying LLMs in multilingual math education, including adaptive-learning opportunities and design considerations for diagram-rich content, while outlining directions for future research and dataset development. The results underscore the potential and limitations of deploying AI-assisted tools in non-English, STEM-rich educational settings and highlight avenues to enhance accuracy and inclusivity through language-aware fine-tuning and domain-specific data.

Abstract

ChatGPT, an Artificial Intelligence model, has the potential to revolutionize education. However, its effectiveness in solving non-English questions remains uncertain. This study evaluates ChatGPT's robustness using 586 Korean mathematics questions. ChatGPT achieves 66.72% accuracy, correctly answering 391 out of 586 questions. We also assess its ability to rate mathematics questions based on eleven criteria and perform a topic analysis. Our findings show that ChatGPT's ratings align with educational theory and test-taker perspectives. While ChatGPT performs well in question classification, it struggles with non-English contexts, highlighting areas for improvement. Future research should address linguistic biases and enhance accuracy across diverse languages. Domain-specific optimizations and multilingual training could improve ChatGPT's role in personalized education.

Paper Structure

This paper contains 13 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: (Left) The accuracy of ChatGPT's solutions for CSAT mathematics questions. (Right) The distribution of score per criteria proposed in Table \ref{['tab:criterion']}
  • Figure 2: Correlations between actual question marks with: (1) the accuracy of AI solutions, (2) predicted difficulty, (3) predicted cognitive demand, (4) predicted time requirement, and (5) predicted originality and engagement
  • Figure 3: The topic analysis by ChatGPT. We give a detailed discussion in Section \ref{['sec:topic_analysis']}
  • Figure 4: Topics of incorrect responses
  • Figure 5: Chi-squared test for the predicted variables in correlation analysis in Figure \ref{['fig:corr_analysis']}
  • ...and 7 more figures

Theorems & Definitions (8)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Remark 7
  • Remark 8