Table of Contents
Fetching ...

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

TL;DR

The proposed automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales, opens a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Abstract

With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for $46$k instances over $12$ benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is $18\%$ more accurate in "computing molar mass", but $19\%$ less accurate in "applying constitutional law", despite the overall accuracies of the three models differing by a mere $0.4\%$. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a $3\%$ accuracy improvement over our $12$ dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

TL;DR

The proposed automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales, opens a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Abstract

With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for k instances over benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is more accurate in "computing molar mass", but less accurate in "applying constitutional law", despite the overall accuracies of the three models differing by a mere . Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a accuracy improvement over our dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Paper Structure

This paper contains 25 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: We leverage model-generated rationales to extract the skills relevant to any evaluation instance. Inspecting accuracy along skill-slices (instances drawn across benchmarks involving the same skill) surfaces fine-grained insights otherwise obfuscated by aggregate accuracy.
  • Figure 2: (left) Sample GPT-4o generated rationale: each skill is listed under multiple names of cascaded granularity, and localized to a specific step and concluding claim. (right) Annotated skills can be verified independently with a second model or human. We include randomly sampled negative skills and multiple verifiers to assure the quality of the verification, as detailed in \ref{['subsec:verifiers']}.
  • Figure 3: (left) Post-hoc verification shows GPT-4o-annotated skills are relevant, and that automatic verifiers are reliable, as they admit low rates of false positives (marking a randomly sampled negative skill as relevant). (right) Rate that GPT-4o-annotated skills are marked as relevant, separated by if GPT-4o correctly answered the underlying evaluation instance (blue) or not (orange). Empirically, annotated skills have high relevancy rates even when the annotator incorrectly answers the question.
  • Figure 4: Skill-slices shed insight on how models evolve over new releases. For GPT and Claude models, skills related to law see the largest increases in accuracy, while Gemini models improve most in performing math and science skills.
  • Figure 5: Unique strengths (top) and weaknesses (bottom) of GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, relative to one another. For each model, we present skills where the model's slice accuracy is highest / lowest (respectively) relative to the average of the other two model accuracies.
  • ...and 7 more figures