Table of Contents
Fetching ...

Assessing UML Diagrams by GPT: Implications for Education

Chong Wang, Beian Wang, Peng Liang, Jie Liang

TL;DR

This study evaluates the feasibility of using GPT-4o to automatically grade UML diagrams in software modeling education by defining 11 evaluation criteria across use case, class, and sequence diagrams and validating them with 40 students. Using a role-based prompt, GPT-4o grades three diagram types and produces detailed deductions, which are then compared to human expert scores. The findings show GPT can perform automatic assessment and provide personalized feedback, but it generally underperforms relative to human graders, with consistent gaps that vary by diagram type and criteria. The results highlight both the promise and current limitations of AI-assisted grading in SE education, suggesting directions for prompt engineering, criterion refinement, and broader studies to extend AI support to UML modeling tasks and other domains.

Abstract

In software engineering (SE) research and practice, UML is well known as an essential modeling methodology for requirements analysis and software modeling in both academia and industry. In particular, fundamental knowledge of UML modeling and practice in creating high-quality UML diagrams are included in SE-relevant courses in the undergraduate programs of many universities. This leads to a time-consuming and labor-intensive task for educators to review and grade a large number of UML diagrams created by the students. Recent advances in generative AI techniques, such as GPT, have paved new ways to automate many SE tasks. However, current research or tools seldom explore the capabilities of GPT in evaluating the quality of UML diagrams. This paper aims to investigate the feasibility and performance of GPT in assessing the quality of UML use case diagrams, class diagrams, and sequence diagrams. First, 11 evaluation criteria with grading details were proposed for these UML diagrams. Next, a series of experiments was designed and conducted on 40 students' UML modeling reports to explore the performance of GPT in evaluating and grading these UML diagrams. The research findings reveal that GPT can complete this assessment task, but it cannot replace human experts yet. Meanwhile, there are five evaluation discrepancies between GPT and human experts. These discrepancies vary in the use of different evaluation criteria in different types of UML diagrams, presenting GPT's strengths and weaknesses in this automatic evaluation task.

Assessing UML Diagrams by GPT: Implications for Education

TL;DR

This study evaluates the feasibility of using GPT-4o to automatically grade UML diagrams in software modeling education by defining 11 evaluation criteria across use case, class, and sequence diagrams and validating them with 40 students. Using a role-based prompt, GPT-4o grades three diagram types and produces detailed deductions, which are then compared to human expert scores. The findings show GPT can perform automatic assessment and provide personalized feedback, but it generally underperforms relative to human graders, with consistent gaps that vary by diagram type and criteria. The results highlight both the promise and current limitations of AI-assisted grading in SE education, suggesting directions for prompt engineering, criterion refinement, and broader studies to extend AI support to UML modeling tasks and other domains.

Abstract

In software engineering (SE) research and practice, UML is well known as an essential modeling methodology for requirements analysis and software modeling in both academia and industry. In particular, fundamental knowledge of UML modeling and practice in creating high-quality UML diagrams are included in SE-relevant courses in the undergraduate programs of many universities. This leads to a time-consuming and labor-intensive task for educators to review and grade a large number of UML diagrams created by the students. Recent advances in generative AI techniques, such as GPT, have paved new ways to automate many SE tasks. However, current research or tools seldom explore the capabilities of GPT in evaluating the quality of UML diagrams. This paper aims to investigate the feasibility and performance of GPT in assessing the quality of UML use case diagrams, class diagrams, and sequence diagrams. First, 11 evaluation criteria with grading details were proposed for these UML diagrams. Next, a series of experiments was designed and conducted on 40 students' UML modeling reports to explore the performance of GPT in evaluating and grading these UML diagrams. The research findings reveal that GPT can complete this assessment task, but it cannot replace human experts yet. Meanwhile, there are five evaluation discrepancies between GPT and human experts. These discrepancies vary in the use of different evaluation criteria in different types of UML diagrams, presenting GPT's strengths and weaknesses in this automatic evaluation task.

Paper Structure

This paper contains 19 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Score difference of the whole report between human experts and GPT over 40 students
  • Figure 2: Distribution of score difference in three types of UML diagram created by 40 students
  • Figure 3: Occurrence distribution of the five discrepancy types in grading 40 UML use case diagrams with four UCs
  • Figure 4:
  • Figure 5: Occurrence distribution of the five discrepancy types in grading 40 UML sequence diagrams with three SCs
  • ...and 1 more figures