Automatic Code Summarization via ChatGPT: How Far Are We?

Weisong Sun; Chunrong Fang; Yudu You; Yun Miao; Yi Liu; Yuekang Li; Gelei Deng; Shenghan Huang; Yuchen Chen; Quanjun Zhang; Hanwei Qian; Yang Liu; Zhenyu Chen

Automatic Code Summarization via ChatGPT: How Far Are We?

Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, Hanwei Qian, Yang Liu, Zhenyu Chen

TL;DR

This study systematically evaluates ChatGPT’s zero-shot capability for automatic code summarization on the CSN-Python dataset and compares it to three SOTA models (NCS, CodeBERT, CodeT5) using BLEU, METEOR, and ROUGE-L. Through a pre-study of heuristic questions, the authors derive an effective prompt to generate in-distribution comments and analyze the impact of prompt design on comment conciseness. The results show that ChatGPT generally underperforms the SOTA baselines in key metrics, though it can produce semantically rich and detailed summaries in some cases. The work identifies challenges and opportunities, including prompt engineering, output cropping/templating, benchmark construction, and developing more appropriate evaluation metrics for ChatGPT-based code summarization.

Abstract

To support software developers in understanding and maintaining programs, various automatic code summarization techniques have been proposed to generate a concise natural language comment for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of natural language processing tasks. Among them, ChatGPT is the most popular one which has attracted wide attention from the software engineering community. However, it still remains unclear how ChatGPT performs in (automatic) code summarization. Therefore, in this paper, we focus on evaluating ChatGPT on a widely-used Python dataset called CSN-Python and comparing it with several state-of-the-art (SOTA) code summarization models. Specifically, we first explore an appropriate prompt to guide ChatGPT to generate in-distribution comments. Then, we use such a prompt to ask ChatGPT to generate comments for all code snippets in the CSN-Python test set. We adopt three widely-used metrics (including BLEU, METEOR, and ROUGE-L) to measure the quality of the comments generated by ChatGPT and SOTA models (including NCS, CodeBERT, and CodeT5). The experimental results show that in terms of BLEU and ROUGE-L, ChatGPT's code summarization performance is significantly worse than all three SOTA models. We also present some cases and discuss the advantages and disadvantages of ChatGPT in code summarization. Based on the findings, we outline several open challenges and opportunities in ChatGPT-based code summarization.

Automatic Code Summarization via ChatGPT: How Far Are We?

TL;DR

Abstract

Paper Structure (21 sections, 6 equations, 15 figures, 6 tables)

This paper contains 21 sections, 6 equations, 15 figures, 6 tables.

Introduction
ChatGPT for Automatic Code Summarization
Q1: Can ChatGPT perform code summarization tasks?
Q2: What does the comment generated by ChatGPT look like?
Q3: How to use ChatGPT to generate concise comments?
Q4: What kind of prompt does ChatGPT suggest for generating short comments?
Q5: Which one performs better, the ChatGPT-suggested prompts in Q4 or the prompt proposed in Q3?
Experimental Design
Dataset
Evaluation Metrics
Baseline
Result Analysis and Case Study
Result Analysis
Case Study
successful cases
...and 6 more sections

Figures (15)

Figure 1: Overview of this paper
Figure 2: Prompt for Q1 and answer by ChatGPT to Q1
Figure 3: An example of code snippet and ground-truth summary/comment
Figure 4: Two prompts for Q2 and the corresponding answers by ChatGPT to Q2. The "<code>" in the figure represents a placeholder, and the code snippet we filled in during the experiment is $c_1$ shown in Figure \ref{['fig:comment_number_by_sentence_number']}(a).
Figure 5: Length of comments generated by ChatGPT and the ground-truth comments. '****' ($p<0,0001$) represents the differences between the two groups are extremely significant.
...and 10 more figures

Automatic Code Summarization via ChatGPT: How Far Are We?

TL;DR

Abstract

Automatic Code Summarization via ChatGPT: How Far Are We?

Authors

TL;DR

Abstract

Table of Contents

Figures (15)