Evaluating In-Context Learning of Libraries for Code Generation

Arkil Patel; Siva Reddy; Dzmitry Bahdanau; Pradeep Dasigi

Evaluating In-Context Learning of Libraries for Code Generation

Arkil Patel, Siva Reddy, Dzmitry Bahdanau, Pradeep Dasigi

TL;DR

The paper investigates how Large Language Models can in-context learn to use novel libraries and even new programming languages for code generation. It systematically compares demonstrations, descriptions, and implementations as supervision across multiple open-source and proprietary models, showing that smaller open models can learn from non-demonstration signals and that GPT-4 remains robust to aliasing and constraint scenarios. Constrained generation reveals that enforcing library usage can hinder performance, while learning a language like Isabelle from descriptions or demonstrations remains feasible, albeit model- and aliasing-dependent. Overall, the work suggests in-context learning as a flexible, data-efficient path to adapt LLMs to dynamic coding environments and new programming ecosystems.

Abstract

Contemporary Large Language Models (LLMs) exhibit a high degree of code generation and comprehension capability. A particularly promising area is their ability to interpret code modules from unfamiliar libraries for solving user-instructed tasks. Recent work has shown that large proprietary LLMs can learn novel library usage in-context from demonstrations. These results raise several open questions: whether demonstrations of library usage is required, whether smaller (and more open) models also possess such capabilities, etc. In this work, we take a broader approach by systematically evaluating a diverse array of LLMs across three scenarios reflecting varying levels of domain specialization to understand their abilities and limitations in generating code based on libraries defined in-context. Our results show that even smaller open-source LLMs like Llama-2 and StarCoder demonstrate an adept understanding of novel code libraries based on specification presented in-context. Our findings further reveal that LLMs exhibit a surprisingly high proficiency in learning novel library modules even when provided with just natural language descriptions or raw code implementations of the functions, which are often cheaper to obtain than demonstrations. Overall, our results pave the way for harnessing LLMs in more adaptable and dynamic coding environments.

Evaluating In-Context Learning of Libraries for Code Generation

TL;DR

Abstract

Paper Structure (56 sections, 15 figures, 8 tables)

This paper contains 56 sections, 15 figures, 8 tables.

Introduction
Related Work
Evaluating Code Generation in LLMs.
API and Tool Use.
Learning Novel Libraries
Experimental Setup
Tasks.
Library.
In-context supervision.
Models.
Metrics.
Results
Models learn to use new libraries.
Models can learn from description and code.
Effect of pretraining.
...and 41 more sections

Figures (15)

Figure 1: Illustration of the three types of in-context supervision we use to specify library functions. The examples in this figure are from the GQA dataset gqa and the functions are from the VisProg visprog library.
Figure 2: Illustration of aliasing the function names in VisProg with synonymous words.
Figure 3: Performance ($\uparrow$) of models at solving NL2Python programming problems in our curated dataset using functions defined in-context.
Figure 4: An example of the automated theorem proving task using the Isabelle language.
Figure 5: Illustration of aliasing the Isabelle language.
...and 10 more figures

Evaluating In-Context Learning of Libraries for Code Generation

TL;DR

Abstract

Evaluating In-Context Learning of Libraries for Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)