Which AI Technique Is Better to Classify Requirements? An Experiment with SVM, LSTM, and ChatGPT

Abdelkarim El-Hajjami; Nicolas Fafin; Camille Salinesi

Which AI Technique Is Better to Classify Requirements? An Experiment with SVM, LSTM, and ChatGPT

Abdelkarim El-Hajjami, Nicolas Fafin, Camille Salinesi

TL;DR

The paper addresses the problem of classifying software requirements into functional and quality aspects using four binary classes and compares traditional (SVM, LSTM) and modern (ChatGPT) approaches. It conducts an extensive empirical evaluation across five public datasets and analyzes zero-shot and few-shot prompting for two ChatGPT models, employing a class-aware $F_\beta$ metric due to data imbalance. Key findings show there is no single best technique; GPT-3.5 Few-Shot often performs best for IsFunctional and OnlyQuality, LSTM excels at IsQuality, and GPT-4 Zero-Shot can outperform others for OnlyFunctional at a higher cost, with few-shot benefits largely dependent on the class and baseline strength. The work provides practical guidance for RE practitioners and informs future benchmarking of LLMs in requirements engineering, highlighting the importance of dataset characteristics and class balance.

Abstract

Recently, Large Language Models like ChatGPT have demonstrated remarkable proficiency in various Natural Language Processing tasks. Their application in Requirements Engineering, especially in requirements classification, has gained increasing interest. This paper reports an extensive empirical evaluation of two ChatGPT models, specifically gpt-3.5-turbo, and gpt-4 in both zero-shot and few-shot settings for requirements classification. The question arises as to how these models compare to traditional classification methods, specifically Support Vector Machine and Long Short-Term Memory. Based on five different datasets, our results show that there is no single best technique for all types of requirement classes. Interestingly, the few-shot setting has been found to be beneficial primarily in scenarios where zero-shot results are significantly low.

Which AI Technique Is Better to Classify Requirements? An Experiment with SVM, LSTM, and ChatGPT

TL;DR

metric due to data imbalance. Key findings show there is no single best technique; GPT-3.5 Few-Shot often performs best for IsFunctional and OnlyQuality, LSTM excels at IsQuality, and GPT-4 Zero-Shot can outperform others for OnlyFunctional at a higher cost, with few-shot benefits largely dependent on the class and baseline strength. The work provides practical guidance for RE practitioners and informs future benchmarking of LLMs in requirements engineering, highlighting the importance of dataset characteristics and class balance.

Abstract

Paper Structure (20 sections, 7 tables)

This paper contains 20 sections, 7 tables.

Introduction
Methodology
The classification problem
The Datasets
The Evaluated Models
SVM
LSTM
ChatGPT
Prompt Engineering
Zero-Shot Prompting
Few-Shot Prompting
Querying ChatGPT
Experimental Results Analysis
The Evaluation Metric
RQ1: What is the best technique for requirements classification between SVM, LSTM and ChatGPT?
...and 5 more sections

Which AI Technique Is Better to Classify Requirements? An Experiment with SVM, LSTM, and ChatGPT

TL;DR

Abstract

Which AI Technique Is Better to Classify Requirements? An Experiment with SVM, LSTM, and ChatGPT

Authors

TL;DR

Abstract

Table of Contents