VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, Joseph E Gonzalez
TL;DR
VibeCheck introduces a flexible framework to automatically discover and quantify qualitative 'vibes' that differentiate large language models and predict user preferences. By iteratively discovering axes, validating them with a panel of LLM judges, and refining through misclassified examples, the method yields well-defined, differentiating, and user-aligned vibes. Applied to real-world data (e.g., Llama-3-70b vs GPT-4) and tasks (summarization, math, captioning), vibes outperform predefined criteria in predicting model identity and user preferences, revealing actionable qualitative differences such as formatting, humor, and focus on ethics. The approach broadens model evaluation beyond correctness, with demonstrated utility in guiding model selection and informing design decisions; it also scales to multimodal domains and diverse tasks.
Abstract
Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -- such as tone, formatting, or writing style -- influence user preferences, yet traditional evaluations focus primarily on the singular axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model (vibes) that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discovers vibes from model outputs and then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. VibeCheck discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash. Code and vibe visualizer found at https://bench-mark.org/
