Table of Contents
Fetching ...

(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs

Wanqin Ma, Chenyang Yang, Christian Kästner

TL;DR

The paper studies how evolving LLM APIs cause performance regressions and prompt-design changes in downstream tasks, using an exploratory toxicity-detection case study across the GPT-3.5 family. It shows that API updates and prompt choices produce non-uniform regressions, with many regressions occurring even when aggregate accuracy improves, and even when the model is confident. It argues for rethinking regression testing for LLMs by introducing data-slice regression tests, tracking both prompts and model versions, and explicitly handling non-determinism, including entropy-based uncertainty analyses to locate regression-prone regions. The work outlines a concrete agenda for building regression-testing support for prompting LLM APIs, with practical implications for prompt versioning, monitoring, and robust evaluation during model migrations.

Abstract

Large Language Models (LLMs) are increasingly integrated into software applications. Downstream application developers often access LLMs through APIs provided as a service. However, LLM APIs are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. This can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. Based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving LLM APIs. We argue that regression testing LLMs requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in LLM APIs.

(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs

TL;DR

The paper studies how evolving LLM APIs cause performance regressions and prompt-design changes in downstream tasks, using an exploratory toxicity-detection case study across the GPT-3.5 family. It shows that API updates and prompt choices produce non-uniform regressions, with many regressions occurring even when aggregate accuracy improves, and even when the model is confident. It argues for rethinking regression testing for LLMs by introducing data-slice regression tests, tracking both prompts and model versions, and explicitly handling non-determinism, including entropy-based uncertainty analyses to locate regression-prone regions. The work outlines a concrete agenda for building regression-testing support for prompting LLM APIs, with practical implications for prompt versioning, monitoring, and robust evaluation during model migrations.

Abstract

Large Language Models (LLMs) are increasingly integrated into software applications. Downstream application developers often access LLMs through APIs provided as a service. However, LLM APIs are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. This can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. Based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving LLM APIs. We argue that regression testing LLMs requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in LLM APIs.
Paper Structure (22 sections, 1 equation, 3 figures, 3 tables)

This paper contains 22 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An LLM API update from text-davinci-003 to gpt-3.5-turbo-instruct causes a major performance downgrade on classifying toxic comments. The API update also changes the prompt choice: Prompt A (left) now outperforms Prompt B (right) by 8.7% accuracy.
  • Figure 2: Prompt templates for our experiments on the GitHub discussion dataset. Templates for the Civil Comments dataset are similar with some adaptations.
  • Figure 3: Regressions disproportionally happen when the toxicity relates to politics (25.7% vs. 33.3%), targets code (21.6% vs. 33.3%), or is severe (54.1% vs. 66.7%).