LLM Interactive Optimization of Open Source Python Libraries -- Case Studies and Generalization

Andreas Florath

LLM Interactive Optimization of Open Source Python Libraries -- Case Studies and Generalization

Andreas Florath

TL;DR

This paper explores the feasibility of using large language models (LLMs) in collaborative, human-guided optimization of open-source Python libraries, focusing on Pillow and NumPy. Through methodologically rigorous case studies, it demonstrates that ChatGPT-4 can achieve substantial runtime and energy-efficiency improvements (up to 38x in some cases) when guided by a human expert, with the most robust gains arising from iterative human-in-the-loop workflows rather than autonomous optimization. The work provides detailed transcripts, upstream pull requests, and generalization attempts across loci and even a different LLM (Google Bard), highlighting both the potential and the current limits of LLM-assisted code optimization. It emphasizes the essential role of human expertise, calls for more robust quantitative studies, and proposes pathways for community-driven replication and extension of these findings.

Abstract

With the advent of large language models (LLMs) like GPT-3, a natural question is the extent to which these models can be utilized for source code optimization. This paper presents methodologically stringent case studies applied to well-known open source python libraries pillow and numpy. We find that contemporary LLM ChatGPT-4 (state September and October 2023) is surprisingly adept at optimizing energy and compute efficiency. However, this is only the case in interactive use, with a human expert in the loop. Aware of experimenter bias, we document our qualitative approach in detail, and provide transcript and source code. We start by providing a detailed description of our approach in conversing with the LLM to optimize the _getextrema function in the pillow library, and a quantitative evaluation of the performance improvement. To demonstrate qualitative replicability, we report further attempts on another locus in the pillow library, and one code locus in the numpy library, to demonstrate generalization within and beyond a library. In all attempts, the performance improvement is significant (factor up to 38). We have also not omitted reporting of failed attempts (there were none). We conclude that LLMs are a promising tool for code optimization in open source libraries, but that the human expert in the loop is essential for success. Nonetheless, we were surprised by how few iterations were required to achieve substantial performance improvements that were not obvious to the expert in the loop. We would like bring attention to the qualitative nature of this study, more robust quantitative studies would need to introduce a layer of selecting experts in a representative sample -- we invite the community to collaborate.

LLM Interactive Optimization of Open Source Python Libraries -- Case Studies and Generalization

TL;DR

Abstract

LLM Interactive Optimization of Open Source Python Libraries -- Case Studies and Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (4)