Table of Contents
Fetching ...

PoTo: A Hybrid Andersen's Points-to Analysis for Python

Ingkarat Rak-amnouykit, Ana Milanova, Guillaume Baudart, Martin Hirzel, Julian Dolby

TL;DR

PoTo introduces an Andersen-style, flow- and context-insensitive points-to analysis tailored for Python, addressing dynamic features and external libraries through a novel hybrid approach that integrates concrete evaluation. It combines a two-phase pipeline—Python source to 3-address code, then 3-address code to a points-to graph—with a client analysis PoTo+ that derives concrete-like types from the points-to graph. Evaluated against Pytype and neural baselines on ten real-world packages, PoTo+ achieves strong coverage and generally matches or exceeds static baselines in accuracy while scaling better than Pytype. The work demonstrates that static points-to analysis augmented with concrete evaluation can effectively support scalable type inference and program understanding for large Python codebases.

Abstract

As Python is increasingly being adopted for large and complex programs, the importance of static analysis for Python (such as type inference) grows. Unfortunately, static analysis for Python remains a challenging task due to its dynamic language features and its abundant external libraries. To help fill this gap, this paper presents PoTo, an Andersen-style context-insensitive and flow-insensitive points-to analysis for Python. PoTo addresses Python-specific challenges and works for large programs via a novel hybrid evaluation, integrating traditional static points-to analysis with concrete evaluation in the Python interpreter for external library calls. Next, this paper presents PoTo+, a static type inference for Python built on the points-to analysis. We evaluate PoTo+ and compare it to two state-of-the-art Python type inference techniques: (1) the static rule-based Pytype and (2) the deep-learning based DLInfer. Our results show that PoTo+ outperforms both Pytype and DLInfer on existing Python packages.

PoTo: A Hybrid Andersen's Points-to Analysis for Python

TL;DR

PoTo introduces an Andersen-style, flow- and context-insensitive points-to analysis tailored for Python, addressing dynamic features and external libraries through a novel hybrid approach that integrates concrete evaluation. It combines a two-phase pipeline—Python source to 3-address code, then 3-address code to a points-to graph—with a client analysis PoTo+ that derives concrete-like types from the points-to graph. Evaluated against Pytype and neural baselines on ten real-world packages, PoTo+ achieves strong coverage and generally matches or exceeds static baselines in accuracy while scaling better than Pytype. The work demonstrates that static points-to analysis augmented with concrete evaluation can effectively support scalable type inference and program understanding for large Python codebases.

Abstract

As Python is increasingly being adopted for large and complex programs, the importance of static analysis for Python (such as type inference) grows. Unfortunately, static analysis for Python remains a challenging task due to its dynamic language features and its abundant external libraries. To help fill this gap, this paper presents PoTo, an Andersen-style context-insensitive and flow-insensitive points-to analysis for Python. PoTo addresses Python-specific challenges and works for large programs via a novel hybrid evaluation, integrating traditional static points-to analysis with concrete evaluation in the Python interpreter for external library calls. Next, this paper presents PoTo+, a static type inference for Python built on the points-to analysis. We evaluate PoTo+ and compare it to two state-of-the-art Python type inference techniques: (1) the static rule-based Pytype and (2) the deep-learning based DLInfer. Our results show that PoTo+ outperforms both Pytype and DLInfer on existing Python packages.
Paper Structure (29 sections, 3 equations, 7 figures, 2 tables)

This paper contains 29 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustrating example, adapted from DLInfer yan_et_al_2023.
  • Figure 2: 3-address code statements.
  • Figure 3: Syntax of a subset of Python.
  • Figure 4: From Python to 3-address-code. Given an environment $\Gamma$, the interpretation function for a statement $\mathcal{I}(s, \Gamma) = (\Gamma', S)$ (left) returns an updated environment $\Gamma'$ and the 3-address code $S$. The interpretation function for an expression $\mathcal{I}(e, \Gamma) = (V, S)$ (right) returns a set of analysis variables $V$ and the 3-address code $S$. $M$ is the enclosing module, and $\Gamma_0$ is the global environment.
  • Figure 5: Coverage percentages of non-empty keys to total keys (RQ1).
  • ...and 2 more figures