Table of Contents
Fetching ...

Detecting and Fixing API Misuses of Data Science Libraries Using Large Language Models

Akalanka Galappaththi, Francisco Ribeiro, Sarah Nadi

TL;DR

This work tackles API misuses in data-science libraries by introducing DSChecker, an LLM-based approach that leverages both API directives and dynamic data information to detect and fix misuses. It demonstrates that providing structured, directive-aware prompts and data context significantly boosts performance across multiple LLMs, with the best zero-shot configuration achieving strong detection and repair outcomes. An agentic variant, DSChecker_agent, investigates real-world applicability by enabling on-demand information retrieval, showing feasibility though with some performance trade-offs. The study extends to other data-centric libraries and compares with existing LLM-based misuse detectors, highlighting DSChecker's superior detection/fix rates in many settings and outlining practical challenges and future directions for LLM-driven tooling in software libraries.

Abstract

Data science libraries, such as scikit-learn and pandas, specialize in processing and manipulating data. The data-centric nature of these libraries makes the detection of API misuse in them more challenging. This paper introduces DSCHECKER, an LLM-based approach designed for detecting and fixing API misuses of data science libraries. We identify two key pieces of information, API directives and data information, that may be beneficial for API misuse detection and fixing. Using three LLMs and misuses from five data science libraries, we experiment with various prompts. We find that incorporating API directives and data-specific details enhances Dschecker's ability to detect and fix API misuses, with the best-performing model achieving a detection F1-score of 61.18 percent and fixing 51.28 percent of the misuses. Building on these results, we implement Dschecker agent which includes an adaptive function calling mechanism to access information on demand, simulating a real-world setting where information about the misuse is unknown in advance. We find that Dschecker agent achieves 48.65 percent detection F1-score and fixes 39.47 percent of the misuses, demonstrating the promise of LLM-based API misuse detection and fixing in real-world scenarios.

Detecting and Fixing API Misuses of Data Science Libraries Using Large Language Models

TL;DR

This work tackles API misuses in data-science libraries by introducing DSChecker, an LLM-based approach that leverages both API directives and dynamic data information to detect and fix misuses. It demonstrates that providing structured, directive-aware prompts and data context significantly boosts performance across multiple LLMs, with the best zero-shot configuration achieving strong detection and repair outcomes. An agentic variant, DSChecker_agent, investigates real-world applicability by enabling on-demand information retrieval, showing feasibility though with some performance trade-offs. The study extends to other data-centric libraries and compares with existing LLM-based misuse detectors, highlighting DSChecker's superior detection/fix rates in many settings and outlining practical challenges and future directions for LLM-driven tooling in software libraries.

Abstract

Data science libraries, such as scikit-learn and pandas, specialize in processing and manipulating data. The data-centric nature of these libraries makes the detection of API misuse in them more challenging. This paper introduces DSCHECKER, an LLM-based approach designed for detecting and fixing API misuses of data science libraries. We identify two key pieces of information, API directives and data information, that may be beneficial for API misuse detection and fixing. Using three LLMs and misuses from five data science libraries, we experiment with various prompts. We find that incorporating API directives and data-specific details enhances Dschecker's ability to detect and fix API misuses, with the best-performing model achieving a detection F1-score of 61.18 percent and fixing 51.28 percent of the misuses. Building on these results, we implement Dschecker agent which includes an adaptive function calling mechanism to access information on demand, simulating a real-world setting where information about the misuse is unknown in advance. We find that Dschecker agent achieves 48.65 percent detection F1-score and fixes 39.47 percent of the misuses, demonstrating the promise of LLM-based API misuse detection and fixing in real-world scenarios.

Paper Structure

This paper contains 50 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: SimpleImputer misuse dc-misuse-smr:24: strategy="mean" drops column B as it contains only NaN values, causing an indexing error when accessed (from Stack Overflow https://stackoverflow.com/questions/60527883/does-simpleimputer-remove-features).
  • Figure 2: Example of prompt provided to an LLM to detect and fix the misuse in Figure \ref{['fig:imputer']}
  • Figure 3: Effect of adding data to data-dependent misuses vs. non-data-dependent misuses.