Table of Contents
Fetching ...

Icing on the Cake: Automatic Code Summarization at Ericsson

Giriprasad Sridhara, Sujoy Roychowdhury, Sumit Soman, Ranjani H G, Ricardo Britto

TL;DR

The paper tackles the challenge of automatic Java method summarization to aid software maintenance in Ericsson. It systematically compares the SOTA ASAP baseline, which uses static analysis and exemplar prompts, against four lightweight prompting strategies that operate solely on the method body, including a concise WordRestrict prompt. Across Ericsson and open-source Java projects, and using multiple LLMs, the simpler prompts achieve equal or better performance on eight similarity metrics, with robust behavior under method-name masking. This work suggests a practical path to faster, more robust code summarization in commercial environments without heavy reliance on static analysis or exemplar corpora, and demonstrates generalizability through replication on Guava and Elasticsearch.

Abstract

This paper presents our findings on the automatic summarization of Java methods within Ericsson, a global telecommunications company. We evaluate the performance of an approach called Automatic Semantic Augmentation of Prompts (ASAP), which uses a Large Language Model (LLM) to generate leading summary comments for Java methods. ASAP enhances the $LLM's$ prompt context by integrating static program analysis and information retrieval techniques to identify similar exemplar methods along with their developer-written Javadocs, and serves as the baseline in our study. In contrast, we explore and compare the performance of four simpler approaches that do not require static program analysis, information retrieval, or the presence of exemplars as in the ASAP method. Our methods rely solely on the Java method body as input, making them lightweight and more suitable for rapid deployment in commercial software development environments. We conducted experiments on an Ericsson software project and replicated the study using two widely-used open-source Java projects, Guava and Elasticsearch, to ensure the reliability of our results. Performance was measured across eight metrics that capture various aspects of similarity. Notably, one of our simpler approaches performed as well as or better than the ASAP method on both the Ericsson project and the open-source projects. Additionally, we performed an ablation study to examine the impact of method names on Javadoc summary generation across our four proposed approaches and the ASAP method. By masking the method names and observing the generated summaries, we found that our approaches were statistically significantly less influenced by the absence of method names compared to the baseline. This suggests that our methods are more robust to variations in method names and may derive summaries more comprehensively from the method body than the ASAP approach.

Icing on the Cake: Automatic Code Summarization at Ericsson

TL;DR

The paper tackles the challenge of automatic Java method summarization to aid software maintenance in Ericsson. It systematically compares the SOTA ASAP baseline, which uses static analysis and exemplar prompts, against four lightweight prompting strategies that operate solely on the method body, including a concise WordRestrict prompt. Across Ericsson and open-source Java projects, and using multiple LLMs, the simpler prompts achieve equal or better performance on eight similarity metrics, with robust behavior under method-name masking. This work suggests a practical path to faster, more robust code summarization in commercial environments without heavy reliance on static analysis or exemplar corpora, and demonstrates generalizability through replication on Guava and Elasticsearch.

Abstract

This paper presents our findings on the automatic summarization of Java methods within Ericsson, a global telecommunications company. We evaluate the performance of an approach called Automatic Semantic Augmentation of Prompts (ASAP), which uses a Large Language Model (LLM) to generate leading summary comments for Java methods. ASAP enhances the prompt context by integrating static program analysis and information retrieval techniques to identify similar exemplar methods along with their developer-written Javadocs, and serves as the baseline in our study. In contrast, we explore and compare the performance of four simpler approaches that do not require static program analysis, information retrieval, or the presence of exemplars as in the ASAP method. Our methods rely solely on the Java method body as input, making them lightweight and more suitable for rapid deployment in commercial software development environments. We conducted experiments on an Ericsson software project and replicated the study using two widely-used open-source Java projects, Guava and Elasticsearch, to ensure the reliability of our results. Performance was measured across eight metrics that capture various aspects of similarity. Notably, one of our simpler approaches performed as well as or better than the ASAP method on both the Ericsson project and the open-source projects. Additionally, we performed an ablation study to examine the impact of method names on Javadoc summary generation across our four proposed approaches and the ASAP method. By masking the method names and observing the generated summaries, we found that our approaches were statistically significantly less influenced by the absence of method names compared to the baseline. This suggests that our methods are more robust to variations in method names and may derive summaries more comprehensively from the method body than the ASAP approach.
Paper Structure (23 sections, 4 figures, 6 tables)

This paper contains 23 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of the ASAP Approach and Our Study (prompts and contexts are shown in Table \ref{['tab:approaches-javadoc-summary']}).
  • Figure 2: Distribution of various metrics of our best prompt wordrestrict in blue with baseline prompt in red. We show the distributions and p-values for one-sided t-test for means and one-sided KS test. The alternative hypothesis for the t-test is that our prompt has a higher mean than baseline, whereas the alternate hypothesis for the KS test is that our prompt is stochastically greater than the baseline. Metrics having p-values $<0.05$ can be considered as statistically significant.
  • Figure 3: Effect on method name masking on distribution of various metrics for baseline prompt on the left and our best prompt wordrestrict on the right evaluated on the commercial dataset. The blue distributions indicate prompt with the method name available, and the red distributions indicate prompt with the method name masked. We show the distributions and p-values for one-sided t-test for means and one-sided KS test. The alternate hypothesis for the t-test is that the metric has a higher mean with method name unmasked than method name masked whereas the alternative hypothesis for the KS-test is that the metric with method name unmasked is stochastically greater than with method name masked. Metrics having p-values $<0.05$ can be considered as statistically significant.
  • Figure 4: Distribution of the best prompt (approach) across 100 queries (methods) on commercial dataset. There are eight charts corresponding to eight metrics in each sub-figure. In each chart, the vertical bars denote how many times a certain prompt scored best on a certain metric. The metric name is on top of the charts. Our approach leads in 4 of 8 metrics with CodeLlama (top Sub-figure). Our approach leads in 7 of 8 metrics with Llama-70b (bottom sub-figure).