Generating Java Methods: An Empirical Assessment of Four AI-Based Code Assistants

Vincenzo Corso; Leonardo Mariani; Daniela Micucci; Oliviero Riganelli

Generating Java Methods: An Empirical Assessment of Four AI-Based Code Assistants

Vincenzo Corso, Leonardo Mariani, Daniela Micucci, Oliviero Riganelli

TL;DR

The paper tackles the problem of evaluating AI-based code assistants on real-world Java methods with context dependencies, comparing Copilot, Tabnine, ChatGPT, and Bard using a dataset of 100 methods from open-source projects. The authors implement a rigorous methodology involving datasetConstruction from recent GitHub commits, controlled prompts in IDEs or via prompts, and multifaceted evaluation (correctness, McCabe complexity, efficiency, size, and CodeBLEU/Levenshtein similarity). Key findings show Copilot generally outperforms others but no tool dominates, external dependencies severely reduce correctness, and multiple assistants often produce unique correct solutions, suggesting potential gains from collaboration. The work contributes a public dataset, a thorough cross-tool assessment framework, and actionable insights for improving code-assist systems and integration practices.

Abstract

AI-based code assistants are promising tools that can facilitate and speed up code development. They exploit machine learning algorithms and natural language processing to interact with developers, suggesting code snippets (e.g., method implementations) that can be incorporated into projects. Recent studies empirically investigated the effectiveness of code assistants using simple exemplary problems (e.g., the re-implementation of well-known algorithms), which fail to capture the spectrum and nature of the tasks actually faced by developers. In this paper, we expand the knowledge in the area by comparatively assessing four popular AI-based code assistants, namely GitHub Copilot, Tabnine, ChatGPT, and Google Bard, with a dataset of 100 methods that we constructed from real-life open-source Java projects, considering a variety of cases for complexity and dependency from contextual elements. Results show that Copilot is often more accurate than other techniques, yet none of the assistants is completely subsumed by the rest of the approaches. Interestingly, the effectiveness of these solutions dramatically decreases when dealing with dependencies outside the boundaries of single classes.

Generating Java Methods: An Empirical Assessment of Four AI-Based Code Assistants

TL;DR

Abstract

Generating Java Methods: An Empirical Assessment of Four AI-Based Code Assistants

Authors

TL;DR

Abstract

Table of Contents

Figures (7)