A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How

Chaozheng Wang; Shuzheng Gao; Cuiyun Gao; Wenxuan Wang; Chun Yong Chong; Shan Gao; Michael R. Lyu

A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How

Chaozheng Wang, Shuzheng Gao, Cuiyun Gao, Wenxuan Wang, Chun Yong Chong, Shan Gao, Michael R. Lyu

TL;DR

A systematic evaluation of large code models for the API suggestion task provides insights and implications for researchers and developers, which can lay the groundwork for future advancements in the API suggestion task.

Abstract

API suggestion is a critical task in modern software development, assisting programmers by predicting and recommending third-party APIs based on the current context. Recent advancements in large code models (LCMs) have shown promise in the API suggestion task. However, they mainly focus on suggesting which APIs to use, ignoring that programmers may demand more assistance while using APIs in practice including when to use the suggested APIs and how to use the APIs. To mitigate the gap, we conduct a systematic evaluation of LCMs for the API suggestion task in the paper. To facilitate our investigation, we first build a benchmark that contains a diverse collection of code snippets, covering 176 APIs used in 853 popular Java projects. Three distinct scenarios in the API suggestion task are then considered for evaluation, including (1) ``\textit{when to use}'', which aims at determining the desired position and timing for API usage; (2) ``\textit{which to use}'', which aims at identifying the appropriate API from a given library; and (3) ``\textit{how to use}'', which aims at predicting the arguments for a given API. The consideration of the three scenarios allows for a comprehensive assessment of LCMs' capabilities in suggesting APIs for developers. During the evaluation, we choose nine popular LCMs with varying model sizes for the three scenarios. We also perform an in-depth analysis of the influence of context selection on the model performance ...

A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How

TL;DR

Abstract

Paper Structure (37 sections, 7 figures, 4 tables)

This paper contains 37 sections, 7 figures, 4 tables.

Introduction
Overview of Methodology
Benchmark Preparation
Data Collection
Benchmark Construction
Context Types
Research Questions
RQ1: How do different LCMs perform in the three scenarios of API suggestion?
RQ2: How different types of contexts affect LCMs performance in API suggestion?
RQ3: How do contexts affect the token length and throughput of LCMs?
Experiment Setup
Selected LCMs
Evaluation Metrics
Exact Match
API Usage Accuracy
...and 22 more sections

Figures (7)

Figure 1: Three distinct scenarios in the API suggestion task.
Figure 2: The illustration of different context types.
Figure 3: Average token length, throughput, and the ratio between performance and throughput.
Figure 4: Performance and token numbers of different contexts. "L"denotes the number of lines included in the file context.
Figure 5: Case study in the "how to use" scenario, where the experimented LCM is CodeLlama 7B.
...and 2 more figures

A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How

TL;DR

Abstract

A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How

Authors

TL;DR

Abstract

Table of Contents

Figures (7)