Table of Contents
Fetching ...

Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, Graham Neubig

TL;DR

This work tackles open-domain NL-to-code generation by infusing external knowledge into the training process. It pre-trains models on large-scale NL-code pairs mined from StackOverflow and on API documentation, then fine-tunes on a small manually labeled dataset, using a retrieval-based re-sampling strategy to align API-documentation distributions with real user queries. Implemented on a TranX-based framework with reranking, the method achieves a BLEU improvement of up to 2.2% on CoNaLa, and qualitative analysis shows better API selection and argument placement. The approach reduces reliance on costly annotations and demonstrates the value of combining noisy web-derived data with clean API references for code synthesis in general-purpose languages.

Abstract

Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen.

Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

TL;DR

This work tackles open-domain NL-to-code generation by infusing external knowledge into the training process. It pre-trains models on large-scale NL-code pairs mined from StackOverflow and on API documentation, then fine-tunes on a small manually labeled dataset, using a retrieval-based re-sampling strategy to align API-documentation distributions with real user queries. Implemented on a TranX-based framework with reranking, the method achieves a BLEU improvement of up to 2.2% on CoNaLa, and qualitative analysis shows better API selection and argument placement. The approach reduces reliance on costly annotations and demonstrates the value of combining noisy web-derived data with clean API references for code synthesis in general-purpose languages.

Abstract

Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen.

Paper Structure

This paper contains 15 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Our approach: incorporating external knowledge by data re-sampling, pre-training and fine-tuning.
  • Figure 2: Examples from Python API documentation and pre-processed code snippets, including class constructors, methods, and top-level functions. We use red, blue, and green to denote required, optional positional, and optional keyword arguments respectively.