Table of Contents
Fetching ...

RETAIN: Interactive Tool for Regression Testing Guided LLM Migration

Tanay Dixit, Daniel Lee, Sally Fang, Sai Sree Harsha, Anirudh Sureshan, Akash Maharaj, Yunyao Li

TL;DR

RETAIN (REgression Testing guided LLM migrAtIoN), a tool designed explicitly for regression testing in LLM Migrations, is introduced, which enables participants to identify twice as many errors, facilitated experimentation with 75% more prompts, and achieves 12% higher metric scores in a given time frame.

Abstract

Large Language Models (LLMs) are increasingly integrated into diverse applications. The rapid evolution of LLMs presents opportunities for developers to enhance applications continuously. However, this constant adaptation can also lead to performance regressions during model migrations. While several interactive tools have been proposed to streamline the complexity of prompt engineering, few address the specific requirements of regression testing for LLM Migrations. To bridge this gap, we introduce RETAIN (REgression Testing guided LLM migrAtIoN), a tool designed explicitly for regression testing in LLM Migrations. RETAIN comprises two key components: an interactive interface tailored to regression testing needs during LLM migrations, and an error discovery module that facilitates understanding of differences in model behaviors. The error discovery module generates textual descriptions of various errors or differences between model outputs, providing actionable insights for prompt refinement. Our automatic evaluation and empirical user studies demonstrate that RETAIN, when compared to manual evaluation, enabled participants to identify twice as many errors, facilitated experimentation with 75% more prompts, and achieves 12% higher metric scores in a given time frame.

RETAIN: Interactive Tool for Regression Testing Guided LLM Migration

TL;DR

RETAIN (REgression Testing guided LLM migrAtIoN), a tool designed explicitly for regression testing in LLM Migrations, is introduced, which enables participants to identify twice as many errors, facilitated experimentation with 75% more prompts, and achieves 12% higher metric scores in a given time frame.

Abstract

Large Language Models (LLMs) are increasingly integrated into diverse applications. The rapid evolution of LLMs presents opportunities for developers to enhance applications continuously. However, this constant adaptation can also lead to performance regressions during model migrations. While several interactive tools have been proposed to streamline the complexity of prompt engineering, few address the specific requirements of regression testing for LLM Migrations. To bridge this gap, we introduce RETAIN (REgression Testing guided LLM migrAtIoN), a tool designed explicitly for regression testing in LLM Migrations. RETAIN comprises two key components: an interactive interface tailored to regression testing needs during LLM migrations, and an error discovery module that facilitates understanding of differences in model behaviors. The error discovery module generates textual descriptions of various errors or differences between model outputs, providing actionable insights for prompt refinement. Our automatic evaluation and empirical user studies demonstrate that RETAIN, when compared to manual evaluation, enabled participants to identify twice as many errors, facilitated experimentation with 75% more prompts, and achieves 12% higher metric scores in a given time frame.
Paper Structure (30 sections, 7 figures, 5 tables)

This paper contains 30 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Regression Testing for Prompting LLMs. The process involves: (1) input dataset, (2) initial prompt, (3) data slicing algorithm to identify behavioral differences (regressions) across models, and (4) prompt refinement to address identified regressions.
  • Figure 2: RETAIN comprises of three main Panels: Metric Panel, Data Panel, and Error Analysis Panel. It features three pages (A) designed for various prompt engineering tasks (B) Users can set metrics, (C) compare model outputs through charts and side-by-side comparisons, and (D) conduct in-depth analysis of failure cases using the error discovery module. Additionally, users can define LLM assertions to evaluate outputs across different prompts. (E)
  • Figure 3: Error Discovery Module Interaction. (A) Users initiate error generation to identify discrepancies among model outputs in the side-by-side comparison table. (B) For errors of interest, users can employ the support feature (D) to highlight specific model outputs containing the selected error type. (C) The thumbs up/down feature allows users to create or remove custom LLM metrics based on error descriptions.
  • Figure 4: BERTScore Progression Over Time. Solid line is the average score while shaded region is the standard deviation.
  • Figure 5: Post-Study Psychometric Evaluation Results. The x-axis labels are simplified for readability and the full questions are available in Section § \ref{['app:survey_questions']}.
  • ...and 2 more figures