Table of Contents
Fetching ...

Building A Coding Assistant via the Retrieval-Augmented Language Model

Xinze Li, Hanbin Wang, Zhenghao Liu, Shi Yu, Shuo Wang, Yukun Yan, Yukai Fu, Yu Gu, Ge Yu

TL;DR

The proposed CONAN language model aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding, and achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models.

Abstract

Pretrained language models have shown strong effectiveness in code-related tasks, such as code retrieval, code generation, code summarization, and code completion tasks. In this paper, we propose COde assistaNt viA retrieval-augmeNted language model (CONAN), which aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding. Specifically, it consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G). CONAN-R pretrains CodeT5 using Code-Documentation Alignment and Masked Entity Prediction tasks to make language models code structure-aware and learn effective representations for code snippets and documentation. Then CONAN-G designs a dual-view code representation mechanism for implementing a retrieval-augmented code generation model. CONAN-G regards the code documentation descriptions as prompts, which help language models better understand the code semantics. Our experiments show that CONAN achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models. Our further analyses show that CONAN learns tailored representations for both code snippets and documentation by aligning code-documentation data pairs and capturing structural semantics by masking and predicting entities in the code data. Additionally, the retrieved code snippets and documentation provide necessary information from both program language and natural language to assist the code generation process. CONAN can also be used as an assistant for Large Language Models (LLMs), providing LLMs with external knowledge in shorter code document lengths to improve their effectiveness on various code tasks. It shows the ability of CONAN to extract necessary information and help filter out the noise from retrieved code documents.

Building A Coding Assistant via the Retrieval-Augmented Language Model

TL;DR

The proposed CONAN language model aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding, and achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models.

Abstract

Pretrained language models have shown strong effectiveness in code-related tasks, such as code retrieval, code generation, code summarization, and code completion tasks. In this paper, we propose COde assistaNt viA retrieval-augmeNted language model (CONAN), which aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding. Specifically, it consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G). CONAN-R pretrains CodeT5 using Code-Documentation Alignment and Masked Entity Prediction tasks to make language models code structure-aware and learn effective representations for code snippets and documentation. Then CONAN-G designs a dual-view code representation mechanism for implementing a retrieval-augmented code generation model. CONAN-G regards the code documentation descriptions as prompts, which help language models better understand the code semantics. Our experiments show that CONAN achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models. Our further analyses show that CONAN learns tailored representations for both code snippets and documentation by aligning code-documentation data pairs and capturing structural semantics by masking and predicting entities in the code data. Additionally, the retrieved code snippets and documentation provide necessary information from both program language and natural language to assist the code generation process. CONAN can also be used as an assistant for Large Language Models (LLMs), providing LLMs with external knowledge in shorter code document lengths to improve their effectiveness on various code tasks. It shows the ability of CONAN to extract necessary information and help filter out the noise from retrieved code documents.

Paper Structure

This paper contains 19 sections, 14 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: The Motivation of Building A Code Assistant via the Retrieval-Augmented Code Generation Model.
  • Figure 2: The Architecture of COde AssistaNt viA Retrieval-AugmeNted Language Model (CONAN). CONAN consists of a code structure-aware retriever (CONAN-R) and a dual-view code representation mechanism (CONAN-G). We employ Code-Documentation Alignment (CDA) and Masked Entity Prediction (MEP) methods for CONAN-R pretraining. CONAN-G is implemented with the Fusion-in-Decoder (FID) architecture.
  • Figure 3: Examples of the Pretraining Data for CONAN-R. All entities of different functions are annotated with different colors in Figure 3 (b).
  • Figure 4: The impact of the number of retrieved code snippets/documentation on CONAN’s performance.
  • Figure 5: The Similarity between Top-1 Ranked Code Documents and the Target Answers. Based on whether the model's output matches the target answer, the instances in the testing dataset are divided into two groups (pred$==$gold and pred$!=$gold). Then the CBLEU score between the top-1 ranked code document and the target answer is calculated for each group. The higher CBLEU/BLEU score indicates the top-1 ranked code document is more similar to the target answer, which illustrates the retrieved code document is of high quality to assist the code generation or summarization tasks.
  • ...and 1 more figures