CoditT5: Pretraining for Source Code and Natural Language Editing

Jiyang Zhang; Sheena Panthaplackel; Pengyu Nie; Junyi Jessy Li; Milos Gligoric

CoditT5: Pretraining for Source Code and Natural Language Editing

Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

TL;DR

CoditT5 introduces an edit-based pretraining objective for software editing tasks by first generating an edit plan and then the edited target sequence, enabling explicit edit reasoning. Pretrained on CodeSearchNet data and fine-tuned on comment updating, bug fixing, and automated code review, CoditT5 outperforms generation-based baselines and prior editors. The study further demonstrates that simple reranking to combine CoditT5 with CodeT5 yields state-of-the-art results across all three tasks, highlighting complementarity between edit-based and generation-based approaches. The work advances practical editing capabilities in software engineering and provides data/model resources to endow editors with verifiable edit reasoning.

Abstract

Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming standard generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a standard generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks.

CoditT5: Pretraining for Source Code and Natural Language Editing

TL;DR

Abstract

Paper Structure (44 sections, 1 equation, 5 figures, 9 tables)

This paper contains 44 sections, 1 equation, 5 figures, 9 tables.

Introduction
Background
Conditional Sequence Generation
Encoder-Decoder Framework
Transformers
Large Pretrained Language Models
Denoising Autoencoder Pretraining
Fine-tuning for Downstream Tasks
Large Pretrained Language Models for Software Engineering
CoditT5
Pretraining Objective
Edit Plan
Target Sequence
Noising Functions
Pretraining Data
...and 29 more sections

Figures (5)

Figure 1: An example in automated code review task where PLBART merely copies the input which does not match reviewer's comment.
Figure 2: The corrupted text is encoded with a bidirectional encoder, and the decoder is pretrained to generate sequences of edit actions to recover the original text followed by a separation token (<s>), and finally the target sequence
Figure 3: Comparing the output of CodeT5 and CoditT5 for a automated code review example. CodeT5 generates incorrect output that drastically deviates from the input code while CoditT5 generates the correct output, performing only relevant edits.
Figure 4: Examples for automated code review for which CoditT5 generated ambiguous or erroneous edit plans but still managed to generate the correct target sequences.
Figure 5: Examples from comment updating and bug fixing which demonstrate the impact of reranking.

CoditT5: Pretraining for Source Code and Natural Language Editing

TL;DR

Abstract

CoditT5: Pretraining for Source Code and Natural Language Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)