Table of Contents
Fetching ...

LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, Lucas Dixon

TL;DR

This work addresses the scalability and interpretability challenges of evaluating AutoSxS results for LLMs by introducing LLM Comparator, a visual analytics tool with an interactive table and a visualization summary that jointly expose slice-level performance, rater rationales, and text-based patterns via n-grams and custom functions. The system integrates automated rationale summarization and cluster generation, enabling practitioners to understand when and why a baseline and an upgraded system differ, and what qualitative differences emerge across prompts. Deployed within a large technology company, it has attracted hundreds of users and analyzed thousands of AutoSxS experiments, and an observational study with six practitioners demonstrates its usefulness for hypothesis generation and behavior analysis. The work contributes a concrete, scalable workflow for side-by-side evaluation analysis and suggests directions for extending interpretability, including preconfigured patterns and LLM-based metric augmentation.

Abstract

Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at a large technology company. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.

LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

TL;DR

This work addresses the scalability and interpretability challenges of evaluating AutoSxS results for LLMs by introducing LLM Comparator, a visual analytics tool with an interactive table and a visualization summary that jointly expose slice-level performance, rater rationales, and text-based patterns via n-grams and custom functions. The system integrates automated rationale summarization and cluster generation, enabling practitioners to understand when and why a baseline and an upgraded system differ, and what qualitative differences emerge across prompts. Deployed within a large technology company, it has attracted hundreds of users and analyzed thousands of AutoSxS experiments, and an observational study with six practitioners demonstrates its usefulness for hypothesis generation and behavior analysis. The work contributes a concrete, scalable workflow for side-by-side evaluation analysis and suggests directions for extending interpretability, including preconfigured patterns and LLM-based metric augmentation.

Abstract

Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at a large technology company. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.
Paper Structure (19 sections, 3 figures)

This paper contains 19 sections, 3 figures.

Figures (3)

  • Figure 1: Users can inspect the individual ratings to see the detailed rationales used by the raters.
  • Figure 2: The rationale clusters view presents a list of rationales that are frequently used by the automatic rater. Users can dynamically add ones to compare the occurrences of relevant rationales between the two models.
  • Figure 3: Users can dynamically create functions that apply to responses. In this example, a function specified as a regular expression (i.e., "\\ n([*-])\\ s") checks if each response contains bulleted lists, whose results are displayed as purple chips.