Unmasking the Genuine Type Inference Capabilities of LLMs for Java Code Snippets

Yiwen Dong; Zhenyang Xu; Yongqiang Tian; Chengnian Sun

Unmasking the Genuine Type Inference Capabilities of LLMs for Java Code Snippets

Yiwen Dong, Zhenyang Xu, Yongqiang Tian, Chengnian Sun

TL;DR

The paper addresses whether LLMs genuinely infer types in Java code snippets or merely memorize leaked training data. It introduces ThaliaType, a leakage-free benchmark, and applies semantic-preserving transformations to assess execution semantics, contrasting results with StatType-SO to reveal data leakage effects. Across multiple open-weight and closed LLMs, the study shows substantial performance drops on unseen data and highlights that, while SnR remains robust as a constraint-based baseline, LLMs can overfit to leaked training data, especially for frequent FQNs. The findings underscore the need for leakage-free benchmarks and semantic-evaluation methods, and the authors provide open access to ThaliaType and related code to enable rigorous, generalizable assessments of type inference capabilities.

Abstract

Type inference is crucial for reusing online code snippets. Although snippets are prevalently shared on platforms like StackOverflow, they often lack essential type information, such as fully qualified names (FQNs). Recent studies have leveraged Large Language Models (LLMs) to perform type inference for such code snippets, showing promising results. However, these results may suffer from data leakage, as the benchmark, StatType-SO, used for evaluation has been publicly available on GitHub since 2017. Consequently, it remains uncertain whether the strong performance of LLMs reflects genuine semantic understanding of code or is due to the ground truth being included in the training set. This paper strives to comprehensively evaluate the genuine type inference capabilities of LLMs on Java code snippets and identify potential limitations of LLMs. First, we created ThaliaType, a new, previously unreleased benchmark suite designed for type inference evaluation. Second, using the StarCoder2 LLM as baseline, we uncovered data leakage from StatType-SO in StarCoder2's open-source training set and observed that other state-of-the-art LLMs exhibit similar performance drops when evaluated on ThaliaType, with precision decreasing by up to 59% and recall by up to 72%. Finally, we designed semantic-preserving code transformations to test the capabilities of LLMs in understanding the execution semantics of snippets. Results showed that LLMs' performance on StatType-SO is far less robust to these transformations than on ThaliaType, suggesting that the performance on StatType-SO may be biased by data leakage and have limited generalizability. These findings highlight the importance of carefully designed, leakage-free benchmarks for evaluating LLMs on type inference tasks. We recommend future studies adopt ThaliaType for rigorous and reliable assessments of LLMs' genuine type inference capabilities.

Unmasking the Genuine Type Inference Capabilities of LLMs for Java Code Snippets

TL;DR

Abstract

Unmasking the Genuine Type Inference Capabilities of LLMs for Java Code Snippets

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)