(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs
Wanqin Ma, Chenyang Yang, Christian Kästner
TL;DR
The paper studies how evolving LLM APIs cause performance regressions and prompt-design changes in downstream tasks, using an exploratory toxicity-detection case study across the GPT-3.5 family. It shows that API updates and prompt choices produce non-uniform regressions, with many regressions occurring even when aggregate accuracy improves, and even when the model is confident. It argues for rethinking regression testing for LLMs by introducing data-slice regression tests, tracking both prompts and model versions, and explicitly handling non-determinism, including entropy-based uncertainty analyses to locate regression-prone regions. The work outlines a concrete agenda for building regression-testing support for prompting LLM APIs, with practical implications for prompt versioning, monitoring, and robust evaluation during model migrations.
Abstract
Large Language Models (LLMs) are increasingly integrated into software applications. Downstream application developers often access LLMs through APIs provided as a service. However, LLM APIs are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. This can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. Based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving LLM APIs. We argue that regression testing LLMs requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in LLM APIs.
