Table of Contents
Fetching ...

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, Joseph E Gonzalez

TL;DR

VibeCheck introduces a flexible framework to automatically discover and quantify qualitative 'vibes' that differentiate large language models and predict user preferences. By iteratively discovering axes, validating them with a panel of LLM judges, and refining through misclassified examples, the method yields well-defined, differentiating, and user-aligned vibes. Applied to real-world data (e.g., Llama-3-70b vs GPT-4) and tasks (summarization, math, captioning), vibes outperform predefined criteria in predicting model identity and user preferences, revealing actionable qualitative differences such as formatting, humor, and focus on ethics. The approach broadens model evaluation beyond correctness, with demonstrated utility in guiding model selection and informing design decisions; it also scales to multimodal domains and diverse tasks.

Abstract

Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -- such as tone, formatting, or writing style -- influence user preferences, yet traditional evaluations focus primarily on the singular axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model (vibes) that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discovers vibes from model outputs and then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. VibeCheck discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash. Code and vibe visualizer found at https://bench-mark.org/

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

TL;DR

VibeCheck introduces a flexible framework to automatically discover and quantify qualitative 'vibes' that differentiate large language models and predict user preferences. By iteratively discovering axes, validating them with a panel of LLM judges, and refining through misclassified examples, the method yields well-defined, differentiating, and user-aligned vibes. Applied to real-world data (e.g., Llama-3-70b vs GPT-4) and tasks (summarization, math, captioning), vibes outperform predefined criteria in predicting model identity and user preferences, revealing actionable qualitative differences such as formatting, humor, and focus on ethics. The approach broadens model evaluation beyond correctness, with demonstrated utility in guiding model selection and informing design decisions; it also scales to multimodal domains and diverse tasks.

Abstract

Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -- such as tone, formatting, or writing style -- influence user preferences, yet traditional evaluations focus primarily on the singular axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model (vibes) that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discovers vibes from model outputs and then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. VibeCheck discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash. Code and vibe visualizer found at https://bench-mark.org/

Paper Structure

This paper contains 25 sections, 1 equation, 22 figures, 6 tables.

Figures (22)

  • Figure 1: Core components of VibeCheck. A vibe is an axis along which a pair of outputs differ: for example, in the top panel, output A is more friendly while output B is more formal, defining a friendliness vibe. To score a prompt output triplet, a panel of LLM judges are used to determine which output falls higher on the vibe, resulting in a score of 1 (A), -1(B), or 0(tie). Finally, the scores obtained over a large set of outputs along with preference labels are used to compute vibe utility.
  • Figure 2: Comparing Llama-3-70b VS GPT-4 & Claude-3-Opus on Chatbot Arena. Negative separability scores indicate Llama-3-70B aligns with the low (red) description, while negative preference coefficients show alignment with low descriptions is preferred. We see that Llama is more humorous, utilizes more formatting, provides more examples, and comments much less on ethics than GPT and Claude: all attributes which correlate positively with human preference.
  • Figure 3: Comparing user preference and separability across STEM and writing tasks. On predefined list of vibes referenced in \ref{['tab:arena_vibe_results']}. Negative preference coefficients indicate a preference for low-description vibes, while negative separability scores show Llama responses align more with the low description than Claude or GPT responses. For writing tasks, detailed explanations, humor, and expressive emotion positively correlate with human preference, while these traits negatively correlate with STEM tasks. Conversely, logical rigor has a stronger positive impact on preference for STEM tasks. These trends are reflected in separability scores, with less separability on STEM tasks for vibes like humor and emotional tone, and more separability for logical rigor.
  • Figure 4: Top 5 vibes comparing GPT-4o to Llama-3-405B on MATH CoT. Negative separability scores indicate GPT-4o aligns with the low (red) description, while negative preference coefficients show alignment with low descriptions is preferred. GPT-4o outputs contain more LaTex/MathML formatting which positively correlated with human preference while Llama-3-405B has very structured and overly-detailed responses, which is negatively correlated with preference.
  • Figure 5: Weaknesses in the mathematical abilities of the LLM judge (GPT-4o-mini).
  • ...and 17 more figures