Table of Contents
Fetching ...

Unmasking the Genuine Type Inference Capabilities of LLMs for Java Code Snippets

Yiwen Dong, Zhenyang Xu, Yongqiang Tian, Chengnian Sun

TL;DR

The paper addresses whether LLMs genuinely infer types in Java code snippets or merely memorize leaked training data. It introduces ThaliaType, a leakage-free benchmark, and applies semantic-preserving transformations to assess execution semantics, contrasting results with StatType-SO to reveal data leakage effects. Across multiple open-weight and closed LLMs, the study shows substantial performance drops on unseen data and highlights that, while SnR remains robust as a constraint-based baseline, LLMs can overfit to leaked training data, especially for frequent FQNs. The findings underscore the need for leakage-free benchmarks and semantic-evaluation methods, and the authors provide open access to ThaliaType and related code to enable rigorous, generalizable assessments of type inference capabilities.

Abstract

Type inference is crucial for reusing online code snippets. Although snippets are prevalently shared on platforms like StackOverflow, they often lack essential type information, such as fully qualified names (FQNs). Recent studies have leveraged Large Language Models (LLMs) to perform type inference for such code snippets, showing promising results. However, these results may suffer from data leakage, as the benchmark, StatType-SO, used for evaluation has been publicly available on GitHub since 2017. Consequently, it remains uncertain whether the strong performance of LLMs reflects genuine semantic understanding of code or is due to the ground truth being included in the training set. This paper strives to comprehensively evaluate the genuine type inference capabilities of LLMs on Java code snippets and identify potential limitations of LLMs. First, we created ThaliaType, a new, previously unreleased benchmark suite designed for type inference evaluation. Second, using the StarCoder2 LLM as baseline, we uncovered data leakage from StatType-SO in StarCoder2's open-source training set and observed that other state-of-the-art LLMs exhibit similar performance drops when evaluated on ThaliaType, with precision decreasing by up to 59% and recall by up to 72%. Finally, we designed semantic-preserving code transformations to test the capabilities of LLMs in understanding the execution semantics of snippets. Results showed that LLMs' performance on StatType-SO is far less robust to these transformations than on ThaliaType, suggesting that the performance on StatType-SO may be biased by data leakage and have limited generalizability. These findings highlight the importance of carefully designed, leakage-free benchmarks for evaluating LLMs on type inference tasks. We recommend future studies adopt ThaliaType for rigorous and reliable assessments of LLMs' genuine type inference capabilities.

Unmasking the Genuine Type Inference Capabilities of LLMs for Java Code Snippets

TL;DR

The paper addresses whether LLMs genuinely infer types in Java code snippets or merely memorize leaked training data. It introduces ThaliaType, a leakage-free benchmark, and applies semantic-preserving transformations to assess execution semantics, contrasting results with StatType-SO to reveal data leakage effects. Across multiple open-weight and closed LLMs, the study shows substantial performance drops on unseen data and highlights that, while SnR remains robust as a constraint-based baseline, LLMs can overfit to leaked training data, especially for frequent FQNs. The findings underscore the need for leakage-free benchmarks and semantic-evaluation methods, and the authors provide open access to ThaliaType and related code to enable rigorous, generalizable assessments of type inference capabilities.

Abstract

Type inference is crucial for reusing online code snippets. Although snippets are prevalently shared on platforms like StackOverflow, they often lack essential type information, such as fully qualified names (FQNs). Recent studies have leveraged Large Language Models (LLMs) to perform type inference for such code snippets, showing promising results. However, these results may suffer from data leakage, as the benchmark, StatType-SO, used for evaluation has been publicly available on GitHub since 2017. Consequently, it remains uncertain whether the strong performance of LLMs reflects genuine semantic understanding of code or is due to the ground truth being included in the training set. This paper strives to comprehensively evaluate the genuine type inference capabilities of LLMs on Java code snippets and identify potential limitations of LLMs. First, we created ThaliaType, a new, previously unreleased benchmark suite designed for type inference evaluation. Second, using the StarCoder2 LLM as baseline, we uncovered data leakage from StatType-SO in StarCoder2's open-source training set and observed that other state-of-the-art LLMs exhibit similar performance drops when evaluated on ThaliaType, with precision decreasing by up to 59% and recall by up to 72%. Finally, we designed semantic-preserving code transformations to test the capabilities of LLMs in understanding the execution semantics of snippets. Results showed that LLMs' performance on StatType-SO is far less robust to these transformations than on ThaliaType, suggesting that the performance on StatType-SO may be biased by data leakage and have limited generalizability. These findings highlight the importance of carefully designed, leakage-free benchmarks for evaluating LLMs on type inference tasks. We recommend future studies adopt ThaliaType for rigorous and reliable assessments of LLMs' genuine type inference capabilities.

Paper Structure

This paper contains 26 sections, 2 equations, 12 figures, 4 tables, 3 algorithms.

Figures (12)

  • Figure 1: The general workflow. The libraries used to collect code snippets in StatType-SO were also used to generate code snippets in ThaliaType. These code snippets are used in RQ1, and their transformed versions are used in RQ2 to evaluate various type inference techniques. Their results are compared to evaluate the genuine type inference capabilities of LLMs for Java code snippets. Red indicates potential data leakage.
  • Figure 2: An example code snippet with example output and example knowledge base used by constraint-based type inference techniques for resolving ambiguities in types.
  • Figure 3: Example prompt used to infer types on code snippets with LLMs, along with an example response. The placeholder {input_code} is substituted with the given code snippet.
  • Figure 4: Example code snippet from StatType-SO using the JodaTime and JDK libraries. Excessive newlines have been removed for clarity of presentation. These code snippets are generally short. The import statements serve as the ground truth and are removed before the code snippet is used for evaluating type inference.
  • Figure 5: Formatted code snippet generated by Thalia using the JodaTime and JDK libraries. The variable names are randomly generated. The function interfaces were generated by Thalia but are unused in ThaliaType code snippets.
  • ...and 7 more figures