TypeEvalPy: A Micro-benchmarking Framework for Python Type Inference Tools
Ashwin Prasad Shivarpatna Venkatesh, Samkutty Sabu, Jiawei Wang, Amir M. Mir, Li Li, Eric Bodden
TL;DR
TypeEvalPy addresses the need for a standardized, reproducible evaluation framework for Python type inference tools by introducing a containerized micro-benchmark that covers Python 3.10 constructs and a three-part pipeline (Runner, Translator, Result Analyzer) to execute, standardize, and analyze results. The framework evaluates six tools across 18 feature categories using a ground-truth benchmark of 154 code snippets with 845 type annotations, producing metrics such as exact-match rate, precision, recall, soundness, completeness, and Top-$n$ predictions. Experimental results show HeaderGen as the most reliable overall, with Jedi and Pyright excelling in builtins/external contexts, and HiTyper-DL outperforming HiTyper, illustrating the promise of hybrid static-ML approaches; Type4Py lags due to its training-based limitations. The study highlights that leveraging external type stubs and combining static analysis with ML can improve performance, while soundness and completeness remain challenging in Python type inference.
Abstract
In light of the growing interest in type inference research for Python, both researchers and practitioners require a standardized process to assess the performance of various type inference techniques. This paper introduces TypeEvalPy, a comprehensive micro-benchmarking framework for evaluating type inference tools. TypeEvalPy contains 154 code snippets with 845 type annotations across 18 categories that target various Python features. The framework manages the execution of containerized tools, transforms inferred types into a standardized format, and produces meaningful metrics for assessment. Through our analysis, we compare the performance of six type inference tools, highlighting their strengths and limitations. Our findings provide a foundation for further research and optimization in the domain of Python type inference.
