Table of Contents
Fetching ...

Towards Summarizing Code Snippets Using Pre-Trained Transformers

Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano, Emad Aghajani, Gabriele Bavota

TL;DR

The paper tackles the challenge of generating natural language summaries for code snippets rather than entire functions. It introduces SALOON, a multi-task T5-based model that classifies inner comments as code summaries and links them to the specific code statements, trained on a carefully labeled Java dataset and pretrained on CodeSearchNet. Building on SALOON, the authors create a large-scale snippet-description dataset (~554k pairs) and train STUNT, a code-snippet summarizer that outperforms IR and RL baselines on BLEU and METEOR metrics, though many generated summaries remain imperfect. The work demonstrates the feasibility of scalable, snippet-level documentation and provides datasets, models, and replication materials to spur further research, with implications for aiding code comprehension and documentation workflows.

Abstract

When comprehending code, a helping hand may come from the natural language comments documenting it that, unfortunately, are not always there. To support developers in such a scenario, several techniques have been presented to automatically generate natural language summaries for a given code. Most recent approaches exploit deep learning (DL) to automatically document classes or functions, while little effort has been devoted to more fine-grained documentation (e.g., documenting code snippets or even a single statement). Such a design choice is dictated by the availability of training data: For example, in the case of Java, it is easy to create datasets composed of pairs <Method, Javadoc> that can be fed to DL models to teach them how to summarize a method. Such a comment-to-code linking is instead non-trivial when it comes to inner comments documenting a few statements. In this work, we take all the steps needed to train a DL model to document code snippets. First, we manually built a dataset featuring 6.6k comments that have been (i) classified based on their type (e.g., code summary, TODO), and (ii) linked to the code statements they document. Second, we used such a dataset to train a multi-task DL model, taking as input a comment and being able to (i) classify whether it represents a "code summary" or not and (ii) link it to the code statements it documents. Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code with recall and precision higher than 80%. Third, we run this model on 10k projects, identifying and linking code summaries to the documented code. This unlocked the possibility of building a large-scale dataset of documented code snippets that have then been used to train a new DL model able to document code snippets. A comparison with state-of-the-art baselines shows the superiority of the proposed approach.

Towards Summarizing Code Snippets Using Pre-Trained Transformers

TL;DR

The paper tackles the challenge of generating natural language summaries for code snippets rather than entire functions. It introduces SALOON, a multi-task T5-based model that classifies inner comments as code summaries and links them to the specific code statements, trained on a carefully labeled Java dataset and pretrained on CodeSearchNet. Building on SALOON, the authors create a large-scale snippet-description dataset (~554k pairs) and train STUNT, a code-snippet summarizer that outperforms IR and RL baselines on BLEU and METEOR metrics, though many generated summaries remain imperfect. The work demonstrates the feasibility of scalable, snippet-level documentation and provides datasets, models, and replication materials to spur further research, with implications for aiding code comprehension and documentation workflows.

Abstract

When comprehending code, a helping hand may come from the natural language comments documenting it that, unfortunately, are not always there. To support developers in such a scenario, several techniques have been presented to automatically generate natural language summaries for a given code. Most recent approaches exploit deep learning (DL) to automatically document classes or functions, while little effort has been devoted to more fine-grained documentation (e.g., documenting code snippets or even a single statement). Such a design choice is dictated by the availability of training data: For example, in the case of Java, it is easy to create datasets composed of pairs <Method, Javadoc> that can be fed to DL models to teach them how to summarize a method. Such a comment-to-code linking is instead non-trivial when it comes to inner comments documenting a few statements. In this work, we take all the steps needed to train a DL model to document code snippets. First, we manually built a dataset featuring 6.6k comments that have been (i) classified based on their type (e.g., code summary, TODO), and (ii) linked to the code statements they document. Second, we used such a dataset to train a multi-task DL model, taking as input a comment and being able to (i) classify whether it represents a "code summary" or not and (ii) link it to the code statements it documents. Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code with recall and precision higher than 80%. Third, we run this model on 10k projects, identifying and linking code summaries to the documented code. This unlocked the possibility of building a large-scale dataset of documented code snippets that have then been used to train a new DL model able to document code snippets. A comparison with state-of-the-art baselines shows the superiority of the proposed approach.
Paper Structure (24 sections, 1 figure, 7 tables)