Table of Contents
Fetching ...

TypeEvalPy: A Micro-benchmarking Framework for Python Type Inference Tools

Ashwin Prasad Shivarpatna Venkatesh, Samkutty Sabu, Jiawei Wang, Amir M. Mir, Li Li, Eric Bodden

TL;DR

TypeEvalPy addresses the need for a standardized, reproducible evaluation framework for Python type inference tools by introducing a containerized micro-benchmark that covers Python 3.10 constructs and a three-part pipeline (Runner, Translator, Result Analyzer) to execute, standardize, and analyze results. The framework evaluates six tools across 18 feature categories using a ground-truth benchmark of 154 code snippets with 845 type annotations, producing metrics such as exact-match rate, precision, recall, soundness, completeness, and Top-$n$ predictions. Experimental results show HeaderGen as the most reliable overall, with Jedi and Pyright excelling in builtins/external contexts, and HiTyper-DL outperforming HiTyper, illustrating the promise of hybrid static-ML approaches; Type4Py lags due to its training-based limitations. The study highlights that leveraging external type stubs and combining static analysis with ML can improve performance, while soundness and completeness remain challenging in Python type inference.

Abstract

In light of the growing interest in type inference research for Python, both researchers and practitioners require a standardized process to assess the performance of various type inference techniques. This paper introduces TypeEvalPy, a comprehensive micro-benchmarking framework for evaluating type inference tools. TypeEvalPy contains 154 code snippets with 845 type annotations across 18 categories that target various Python features. The framework manages the execution of containerized tools, transforms inferred types into a standardized format, and produces meaningful metrics for assessment. Through our analysis, we compare the performance of six type inference tools, highlighting their strengths and limitations. Our findings provide a foundation for further research and optimization in the domain of Python type inference.

TypeEvalPy: A Micro-benchmarking Framework for Python Type Inference Tools

TL;DR

TypeEvalPy addresses the need for a standardized, reproducible evaluation framework for Python type inference tools by introducing a containerized micro-benchmark that covers Python 3.10 constructs and a three-part pipeline (Runner, Translator, Result Analyzer) to execute, standardize, and analyze results. The framework evaluates six tools across 18 feature categories using a ground-truth benchmark of 154 code snippets with 845 type annotations, producing metrics such as exact-match rate, precision, recall, soundness, completeness, and Top- predictions. Experimental results show HeaderGen as the most reliable overall, with Jedi and Pyright excelling in builtins/external contexts, and HiTyper-DL outperforming HiTyper, illustrating the promise of hybrid static-ML approaches; Type4Py lags due to its training-based limitations. The study highlights that leveraging external type stubs and combining static analysis with ML can improve performance, while soundness and completeness remain challenging in Python type inference.

Abstract

In light of the growing interest in type inference research for Python, both researchers and practitioners require a standardized process to assess the performance of various type inference techniques. This paper introduces TypeEvalPy, a comprehensive micro-benchmarking framework for evaluating type inference tools. TypeEvalPy contains 154 code snippets with 845 type annotations across 18 categories that target various Python features. The framework manages the execution of containerized tools, transforms inferred types into a standardized format, and produces meaningful metrics for assessment. Through our analysis, we compare the performance of six type inference tools, highlighting their strengths and limitations. Our findings provide a foundation for further research and optimization in the domain of Python type inference.
Paper Structure (8 sections, 2 tables)