Table of Contents
Fetching ...

An Empirical Study on the Effectiveness of Large Language Models for SATD Identification and Classification

Mohammad Sadegh Sheikhaei, Yuan Tian, Shaowei Wang, Bowen Xu

TL;DR

The paper empirically evaluates large language models for SATD identification and classification using two datasets, Maldonado-62k and OBrien, across multiple model sizes and adaptation strategies. It demonstrates that fine-tuned LLMs, particularly Flan-T5-XL, achieve state-of-the-art performance in SATD identification, while classification benefits from larger models though CNN baselines remain competitive. In-context learning with Flan-T5-XXL shows competitive results for identification but generally underperforms fine-tuned models, whereas few-shot prompts with category descriptions and contextual features can surpass smaller fine-tuned models in classification. A modified Flan-T5 architecture with a classification head improves SATD classification, and contextual information such as surrounding code enhances performance for larger models. The study emphasizes data quality and contextual signals as key levers for improving SATD modeling with LLMs and provides insights for future work in SATD and SE tasks that involve classification with scarce labeled data.

Abstract

Self-Admitted Technical Debt (SATD), a concept highlighting sub-optimal choices in software development documented in code comments or other project resources, poses challenges in the maintainability and evolution of software systems. Large language models (LLMs) have demonstrated significant effectiveness across a broad range of software tasks, especially in software text generation tasks. Nonetheless, their effectiveness in tasks related to SATD is still under-researched. In this paper, we investigate the efficacy of LLMs in both identification and classification of SATD. For both tasks, we investigate the performance gain from using more recent LLMs, specifically the Flan-T5 family, across different common usage settings. Our results demonstrate that for SATD identification, all fine-tuned LLMs outperform the best existing non-LLM baseline, i.e., the CNN model, with a 4.4% to 7.2% improvement in F1 score. In the SATD classification task, while our largest fine-tuned model, Flan-T5-XL, still led in performance, the CNN model exhibited competitive results, even surpassing four of six LLMs. We also found that the largest Flan-T5 model, i.e., Flan-T5-XXL, when used with a zero-shot in-context learning (ICL) approach for SATD identification, provides competitive results with traditional approaches but performs 6.4% to 9.2% worse than fine-tuned LLMs. For SATD classification, few-shot ICL approach, incorporating examples and category descriptions in prompts, outperforms the zero-shot approach and even surpasses the fine-tuned smaller Flan-T5 models. Moreover, our experiments demonstrate that incorporating contextual information, such as surrounding code, into the SATD classification task enables larger fine-tuned LLMs to improve their performance.

An Empirical Study on the Effectiveness of Large Language Models for SATD Identification and Classification

TL;DR

The paper empirically evaluates large language models for SATD identification and classification using two datasets, Maldonado-62k and OBrien, across multiple model sizes and adaptation strategies. It demonstrates that fine-tuned LLMs, particularly Flan-T5-XL, achieve state-of-the-art performance in SATD identification, while classification benefits from larger models though CNN baselines remain competitive. In-context learning with Flan-T5-XXL shows competitive results for identification but generally underperforms fine-tuned models, whereas few-shot prompts with category descriptions and contextual features can surpass smaller fine-tuned models in classification. A modified Flan-T5 architecture with a classification head improves SATD classification, and contextual information such as surrounding code enhances performance for larger models. The study emphasizes data quality and contextual signals as key levers for improving SATD modeling with LLMs and provides insights for future work in SATD and SE tasks that involve classification with scarce labeled data.

Abstract

Self-Admitted Technical Debt (SATD), a concept highlighting sub-optimal choices in software development documented in code comments or other project resources, poses challenges in the maintainability and evolution of software systems. Large language models (LLMs) have demonstrated significant effectiveness across a broad range of software tasks, especially in software text generation tasks. Nonetheless, their effectiveness in tasks related to SATD is still under-researched. In this paper, we investigate the efficacy of LLMs in both identification and classification of SATD. For both tasks, we investigate the performance gain from using more recent LLMs, specifically the Flan-T5 family, across different common usage settings. Our results demonstrate that for SATD identification, all fine-tuned LLMs outperform the best existing non-LLM baseline, i.e., the CNN model, with a 4.4% to 7.2% improvement in F1 score. In the SATD classification task, while our largest fine-tuned model, Flan-T5-XL, still led in performance, the CNN model exhibited competitive results, even surpassing four of six LLMs. We also found that the largest Flan-T5 model, i.e., Flan-T5-XXL, when used with a zero-shot in-context learning (ICL) approach for SATD identification, provides competitive results with traditional approaches but performs 6.4% to 9.2% worse than fine-tuned LLMs. For SATD classification, few-shot ICL approach, incorporating examples and category descriptions in prompts, outperforms the zero-shot approach and even surpasses the fine-tuned smaller Flan-T5 models. Moreover, our experiments demonstrate that incorporating contextual information, such as surrounding code, into the SATD classification task enables larger fine-tuned LLMs to improve their performance.
Paper Structure (54 sections, 3 figures, 14 tables)

This paper contains 54 sections, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Updating Flan-T5 architecture by replacing the last layer with a classification layer
  • Figure 2: Average F1 score across 10 projects in the Maldonado-62k dataset over epochs
  • Figure 3: Average accuracy across 10 folds over epochs (Dataset: OBrien, Approach: 10-fold cross validation, number of runs: 3)