Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

Mazda Moayeri; Vidhisha Balachandran; Varun Chandrasekaran; Safoora Yousefi; Thomas Fel; Soheil Feizi; Besmira Nushi; Neel Joshi; Vibhav Vineet

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

TL;DR

The proposed automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales, opens a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Abstract

With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for $46$k instances over $12$ benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is $18\%$ more accurate in "computing molar mass", but $19\%$ less accurate in "applying constitutional law", despite the overall accuracies of the three models differing by a mere $0.4\%$. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a $3\%$ accuracy improvement over our $12$ dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

TL;DR

Abstract

k instances over

benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is

more accurate in "computing molar mass", but

less accurate in "applying constitutional law", despite the overall accuracies of the three models differing by a mere

. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a

accuracy improvement over our

dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

TL;DR

Abstract

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)