Table of Contents
Fetching ...

The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks

Ashwin Prasad Shivarpatna Venkatesh, Samkutty Sabu, Amir M. Mir, Sofia Reis, Eric Bodden

TL;DR

The paper investigates whether Large Language Models can meaningfully aid static analysis tasks in Python by evaluating 26 models on micro-benchmarks for callgraph analysis (PyCG, HeaderGen) and type inference (TypeEvalPy). It finds that LLMs substantially improve type inference relative to traditional static analyses, with GPT-4 leading untuned models and fine-tuned GPT-3.5 Turbo closing the gap; however, callgraph analysis remains better served by traditional SA methods, even with LLM assistance. The study highlights practical considerations such as compute and cost, privacy concerns, and the potential of open-source LLMs and model compression to enable local deployment. These results offer a baseline and roadmap for integrating LLMs into static analysis workflows and point to task-specific fine-tuning as a key lever. Overall, the work delineates a path toward leveraging LLMs for SA tasks while acknowledging current limitations and deployment constraints.

Abstract

The application of Large Language Models (LLMs) in software engineering, particularly in static analysis tasks, represents a paradigm shift in the field. In this paper, we investigate the role that current LLMs can play in improving callgraph analysis and type inference for Python programs. Using the PyCG, HeaderGen, and TypeEvalPy micro-benchmarks, we evaluate 26 LLMs, including OpenAI's GPT series and open-source models such as LLaMA. Our study reveals that LLMs show promising results in type inference, demonstrating higher accuracy than traditional methods, yet they exhibit limitations in callgraph analysis. This contrast emphasizes the need for specialized fine-tuning of LLMs to better suit specific static analysis tasks. Our findings provide a foundation for further research towards integrating LLMs for static analysis tasks.

The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks

TL;DR

The paper investigates whether Large Language Models can meaningfully aid static analysis tasks in Python by evaluating 26 models on micro-benchmarks for callgraph analysis (PyCG, HeaderGen) and type inference (TypeEvalPy). It finds that LLMs substantially improve type inference relative to traditional static analyses, with GPT-4 leading untuned models and fine-tuned GPT-3.5 Turbo closing the gap; however, callgraph analysis remains better served by traditional SA methods, even with LLM assistance. The study highlights practical considerations such as compute and cost, privacy concerns, and the potential of open-source LLMs and model compression to enable local deployment. These results offer a baseline and roadmap for integrating LLMs into static analysis workflows and point to task-specific fine-tuning as a key lever. Overall, the work delineates a path toward leveraging LLMs for SA tasks while acknowledging current limitations and deployment constraints.

Abstract

The application of Large Language Models (LLMs) in software engineering, particularly in static analysis tasks, represents a paradigm shift in the field. In this paper, we investigate the role that current LLMs can play in improving callgraph analysis and type inference for Python programs. Using the PyCG, HeaderGen, and TypeEvalPy micro-benchmarks, we evaluate 26 LLMs, including OpenAI's GPT series and open-source models such as LLaMA. Our study reveals that LLMs show promising results in type inference, demonstrating higher accuracy than traditional methods, yet they exhibit limitations in callgraph analysis. This contrast emphasizes the need for specialized fine-tuning of LLMs to better suit specific static analysis tasks. Our findings provide a foundation for further research towards integrating LLMs for static analysis tasks.
Paper Structure (11 sections, 2 tables)