Table of Contents
Fetching ...

The Last Dependency Crusade: Solving Python Dependency Conflicts with LLMs

Antony Bartlett, Cynthia Liem, Annibale Panichella

TL;DR

This work tackles Python dependency conflicts by introducing PLLM, an LLM-driven, retrieval-augmented pipeline that iteratively infers and validates dependencies using Docker-based builds and error-driven feedback. The approach follows a five-stage process (infer modules and Python version, assign versions, build and validate, analyze errors, and refine with feedback) and leverages real PyPI metadata to constrain version choices. Evaluations on the HG2.9K benchmark show PLLM, particularly when using Gemma-2 9B with RAG, achieves superior fix rates compared to knowledge-graph baselines, with notable strength on complex dependency graphs and ML libraries. The results underscore the potential of hybrid, prompt-based reasoning combined with concrete runtime feedback for practical, scalable dependency resolution, while acknowledging complementary strengths of graph-based methods for system-level concerns and legacy dependencies.

Abstract

Resolving Python dependency issues remains a tedious and error-prone process, forcing developers to manually trial compatible module versions and interpreter configurations. Existing automated solutions, such as knowledge-graph-based and database-driven methods, face limitations due to the variety of dependency error types, large sets of possible module versions, and conflicts among transitive dependencies. This paper investigates the use of Large Language Models (LLMs) to automatically repair dependency issues in Python programs. We propose PLLM (pronounced "plum"), a novel retrieval-augmented generation (RAG) approach that iteratively infers missing or incorrect dependencies. PLLM builds a test environment where the LLM proposes module combinations, observes execution feedback, and refines its predictions using natural language processing (NLP) to parse error messages. We evaluate PLLM on the Gistable HG2.9K dataset, a curated collection of real-world Python programs. Using this benchmark, we explore multiple PLLM configurations, including six open-source LLMs evaluated both with and without RAG. Our findings show that RAG consistently improves fix rates, with the best performance achieved by Gemma-2 9B when combined with RAG. Compared to two state-of-the-art baselines, PyEGo and ReadPyE, PLLM achieves significantly higher fix rates; +15.97\% more than ReadPyE and +21.58\% more than PyEGo. Further analysis shows that PLLM is especially effective for projects with numerous dependencies and those using specialized numerical or machine-learning libraries.

The Last Dependency Crusade: Solving Python Dependency Conflicts with LLMs

TL;DR

This work tackles Python dependency conflicts by introducing PLLM, an LLM-driven, retrieval-augmented pipeline that iteratively infers and validates dependencies using Docker-based builds and error-driven feedback. The approach follows a five-stage process (infer modules and Python version, assign versions, build and validate, analyze errors, and refine with feedback) and leverages real PyPI metadata to constrain version choices. Evaluations on the HG2.9K benchmark show PLLM, particularly when using Gemma-2 9B with RAG, achieves superior fix rates compared to knowledge-graph baselines, with notable strength on complex dependency graphs and ML libraries. The results underscore the potential of hybrid, prompt-based reasoning combined with concrete runtime feedback for practical, scalable dependency resolution, while acknowledging complementary strengths of graph-based methods for system-level concerns and legacy dependencies.

Abstract

Resolving Python dependency issues remains a tedious and error-prone process, forcing developers to manually trial compatible module versions and interpreter configurations. Existing automated solutions, such as knowledge-graph-based and database-driven methods, face limitations due to the variety of dependency error types, large sets of possible module versions, and conflicts among transitive dependencies. This paper investigates the use of Large Language Models (LLMs) to automatically repair dependency issues in Python programs. We propose PLLM (pronounced "plum"), a novel retrieval-augmented generation (RAG) approach that iteratively infers missing or incorrect dependencies. PLLM builds a test environment where the LLM proposes module combinations, observes execution feedback, and refines its predictions using natural language processing (NLP) to parse error messages. We evaluate PLLM on the Gistable HG2.9K dataset, a curated collection of real-world Python programs. Using this benchmark, we explore multiple PLLM configurations, including six open-source LLMs evaluated both with and without RAG. Our findings show that RAG consistently improves fix rates, with the best performance achieved by Gemma-2 9B when combined with RAG. Compared to two state-of-the-art baselines, PyEGo and ReadPyE, PLLM achieves significantly higher fix rates; +15.97\% more than ReadPyE and +21.58\% more than PyEGo. Further analysis shows that PLLM is especially effective for projects with numerous dependencies and those using specialized numerical or machine-learning libraries.

Paper Structure

This paper contains 24 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of PLLM, which encompasses five main stages: (A) extracting import statements from the input Python file, (B) prompting the model to infer modules and Python versions, (C) generating candidate dependency fixes using PyPI metadata, (D) validating candidate fixes through Docker build and execution, and (E) providing error-based feedback to the LLM for iterative refinement.
  • Figure 2: Cumulative chart displaying the number of fixes found by PLLM across multiple runs.
  • Figure 3: Intersections and difference sets for the projects successfully fixed by PLLM, PyEGo, and ReadPyE.