Table of Contents
Fetching ...

QualEval: Qualitative Evaluation for Model Improvement

Vishvak Murahari, Ameet Deshpande, Peter Clark, Tanmay Rajpurohit, Ashish Sabharwal, Karthik Narasimhan, Ashwin Kalyan

TL;DR

QualEval addresses the inadequacy of single-scalar metrics for real-world AI tasks by introducing a qualitative evaluation framework that augments traditional metrics with a faithful, interpretable dashboard. It combines an evaluator LLM with a novel flexible LP solver to automatically discover and assign dataset attributes (domains and sub-tasks) and to generate human-readable, actionable insights for model improvement. The approach yields faithful priors, enables fine-grained proficiency analyses, and demonstrates practical gains, including up to 15% relative improvement on DialogSum via targeted data augmentation of a Llama 2 model. By framing evaluation as a data-scientist-in-a-box, QualEval accelerates model development and provides a general path toward more actionable, task-aware model diagnostics.

Abstract

Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a way to compare and benchmark models, and do not yield actionable diagnostics, thus making the model improvement process challenging. Model developers find themselves amid extensive manual efforts involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are backed by a comprehensive dashboard with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace of model development, thus in essence serving as a data-scientist-in-a-box. Given the focus on critiquing and improving current evaluation metrics, our method serves as a refreshingly new technique for both model evaluation and improvement.

QualEval: Qualitative Evaluation for Model Improvement

TL;DR

QualEval addresses the inadequacy of single-scalar metrics for real-world AI tasks by introducing a qualitative evaluation framework that augments traditional metrics with a faithful, interpretable dashboard. It combines an evaluator LLM with a novel flexible LP solver to automatically discover and assign dataset attributes (domains and sub-tasks) and to generate human-readable, actionable insights for model improvement. The approach yields faithful priors, enables fine-grained proficiency analyses, and demonstrates practical gains, including up to 15% relative improvement on DialogSum via targeted data augmentation of a Llama 2 model. By framing evaluation as a data-scientist-in-a-box, QualEval accelerates model development and provides a general path toward more actionable, task-aware model diagnostics.

Abstract

Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a way to compare and benchmark models, and do not yield actionable diagnostics, thus making the model improvement process challenging. Model developers find themselves amid extensive manual efforts involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are backed by a comprehensive dashboard with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace of model development, thus in essence serving as a data-scientist-in-a-box. Given the focus on critiquing and improving current evaluation metrics, our method serves as a refreshingly new technique for both model evaluation and improvement.
Paper Structure (37 sections, 3 equations, 18 figures, 1 table)

This paper contains 37 sections, 3 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: QualEval goes beyond a single scalar metric and provides a dashboard that helps understand the model's performance in a fine-grained manner. QualEval's insights are faithful and lead to accelerated performance improvement when applied to the model. The dashboard visualizes the performance of the davinci-3 model on MBPP.
  • Figure 2: QualEval automatically discovers domains and sub-tasks from input data through an evaluator LLM, $\mathcal{E}$. QualEval then automatically assigns 2 domains and 2 sub-tasks to every sample in the dataset by solving a flexible linear program. Finally, QualEval generates a comprehensive dashboard and presents interpretable and actionable insights for practitioners.
  • Figure 3: Prior probabilities of domains and sub-tasks on the MBPP (top) and DialogSum (bottom) datasets
  • Figure 4: QualEval faithfully discovers and scores attributes. We compare the domain priors discovered by QualEval(right) with the ground truth domain annotations (left) in the MedMCQA dataset and find a high degree of alignment (e.g., "Pediatrics" -- $9\%$ vs $9\%$, "Obstetrics and Gynecology" -- $6\%$ vs $7\%$, and "Pharmacology" -- $6\%$ vs $6\%$).
  • Figure 5: Proficiency breakdown for different sub-tasks and domains in the MBPP and MMLU (clinical knowledge) datasets for davinci-3.
  • ...and 13 more figures