How Effective are Large Language Models in Generating Software Specifications?
Danning Xie, Byungwoo Yoo, Nan Jiang, Mijung Kim, Lin Tan, Xiangyu Zhang, Judy S. Lee
TL;DR
<3-5 sentence high-level summary>This paper presents the first empirical evaluation of large language models for generating software specifications from natural-language inputs (comments/docs) and compares 13 LLMs against traditional rule-based approaches on three public datasets. By leveraging few-shot learning and exploring prompt-construction strategies (random vs semantic retrieval), the study demonstrates that LLMs—especially CodeLlama-13B and StarCoder2-15B—can achieve comparable or superior accuracy and F1 scores with far fewer annotated examples than traditional methods. A detailed failure-diagnosis reveals complementary strengths and weaknesses: LLMs excel with high-quality prompts and domain knowledge, while traditional methods struggle with missing rules but produce fewer ill-formed outputs. The work points to promising directions, including hybrid hybrid approaches and improved prompting, to further enhance spec-generation in software engineering, with open-source models offering cost-effective and flexible options.
Abstract
Software specifications are essential for many Software Engineering (SE) tasks such as bug detection and test generation. Many existing approaches are proposed to extract the specifications defined in natural language form (e.g., comments) into formal machine readable form (e.g., first order logic). However, existing approaches suffer from limited generalizability and require manual efforts. The recent emergence of Large Language Models (LLMs), which have been successfully applied to numerous SE tasks, offers a promising avenue for automating this process. In this paper, we conduct the first empirical study to evaluate the capabilities of LLMs for generating software specifications from software comments or documentation. We evaluate LLMs performance with Few Shot Learning (FSL) and compare the performance of 13 state of the art LLMs with traditional approaches on three public datasets. In addition, we conduct a comparative diagnosis of the failure cases from both LLMs and traditional methods, identifying their unique strengths and weaknesses. Our study offers valuable insights for future research to improve specification generation.
