LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, Lucas Dixon
TL;DR
This work addresses the scalability and interpretability challenges of evaluating AutoSxS results for LLMs by introducing LLM Comparator, a visual analytics tool with an interactive table and a visualization summary that jointly expose slice-level performance, rater rationales, and text-based patterns via n-grams and custom functions. The system integrates automated rationale summarization and cluster generation, enabling practitioners to understand when and why a baseline and an upgraded system differ, and what qualitative differences emerge across prompts. Deployed within a large technology company, it has attracted hundreds of users and analyzed thousands of AutoSxS experiments, and an observational study with six practitioners demonstrates its usefulness for hypothesis generation and behavior analysis. The work contributes a concrete, scalable workflow for side-by-side evaluation analysis and suggests directions for extending interpretability, including preconfigured patterns and LLM-based metric augmentation.
Abstract
Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at a large technology company. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.
