Table of Contents
Fetching ...

Should Code Models Learn Pedagogically? A Preliminary Evaluation of Curriculum Learning for Real-World Software Engineering Tasks

Kyi Shin Khant, Hong Yi Lin, Patanamon Thongtanunam

TL;DR

The paper investigates whether curriculum learning (CL) using conventional code-difficulty metrics can improve real-world software engineering tasks. It uses CodeT5-base fine-tuned on CodeXGLUE to evaluate CL across two tasks—code clone detection and code summarization—with four difficulty levels derived from $L$ (length) and $C$ (cyclomatic complexity) and two training schedules ($s$ and $r$). Contrary to prior synthetic-code results, CL with these metrics fails to beat unordered training and shows signs of catastrophic forgetting and shortcut learning, with performance saturating after the initial training subset. The findings suggest potential limits in model capacity or task difficulty and motivate broader CL evaluations across models, tasks, and languages to identify when CL yields real-world benefits in SE.

Abstract

Learning-based techniques, especially advanced pre-trained models for code have demonstrated capabilities in code understanding and generation, solving diverse software engineering (SE) tasks. Despite the promising results, current training approaches may not fully optimize model performance, as they typically involve learning from randomly shuffled training data. Recent work shows that Curriculum Learning (CL) can improve performance on code-related tasks through incremental learning based on the difficulty of synthetic code. Yet, the effectiveness of CL with conventional difficulty measures in SE tasks remains largely unexplored. In this study, we explore two conventional code metrics: code length and cyclomatic complexity to determine the difficulty levels. We investigate how the pre-trained code model (CodeT5) learns under CL, through the tasks of code clone detection and code summarization. Our empirical study on the CodeXGLUE benchmark showed contrasting results to prior studies, where the model exhibited signs of catastrophic forgetting and shortcut learning. Surprisingly, model performance saturates after only the first quartile of training, potentially indicating a limit in the model's representation capacity and/or the task's inherent difficulty. Future work should further explore various CL strategies with different code models across a wider range of SE tasks for a more holistic understanding.

Should Code Models Learn Pedagogically? A Preliminary Evaluation of Curriculum Learning for Real-World Software Engineering Tasks

TL;DR

The paper investigates whether curriculum learning (CL) using conventional code-difficulty metrics can improve real-world software engineering tasks. It uses CodeT5-base fine-tuned on CodeXGLUE to evaluate CL across two tasks—code clone detection and code summarization—with four difficulty levels derived from (length) and (cyclomatic complexity) and two training schedules ( and ). Contrary to prior synthetic-code results, CL with these metrics fails to beat unordered training and shows signs of catastrophic forgetting and shortcut learning, with performance saturating after the initial training subset. The findings suggest potential limits in model capacity or task difficulty and motivate broader CL evaluations across models, tasks, and languages to identify when CL yields real-world benefits in SE.

Abstract

Learning-based techniques, especially advanced pre-trained models for code have demonstrated capabilities in code understanding and generation, solving diverse software engineering (SE) tasks. Despite the promising results, current training approaches may not fully optimize model performance, as they typically involve learning from randomly shuffled training data. Recent work shows that Curriculum Learning (CL) can improve performance on code-related tasks through incremental learning based on the difficulty of synthetic code. Yet, the effectiveness of CL with conventional difficulty measures in SE tasks remains largely unexplored. In this study, we explore two conventional code metrics: code length and cyclomatic complexity to determine the difficulty levels. We investigate how the pre-trained code model (CodeT5) learns under CL, through the tasks of code clone detection and code summarization. Our empirical study on the CodeXGLUE benchmark showed contrasting results to prior studies, where the model exhibited signs of catastrophic forgetting and shortcut learning. Surprisingly, model performance saturates after only the first quartile of training, potentially indicating a limit in the model's representation capacity and/or the task's inherent difficulty. Future work should further explore various CL strategies with different code models across a wider range of SE tasks for a more holistic understanding.

Paper Structure

This paper contains 15 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Code Clone Detection ($L_{s}$, $L_{r}$, $C_{s}$, $C_{r}$)
  • Figure 2: Code Summarization ($L_{s}$, $L_{r}$, $C_{s}$, $C_{r}$)