Table of Contents
Fetching ...

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, Sanjeev Arora

TL;DR

Skill-Mix offers a flexible evaluation framework to test AI models' ability to compose multiple skills by generating text that demonstrates randomly drawn k-skill subsets across topics. It employs a two-stage generation-and-grading pipeline with auto-grading by GPT-4 and LLaMA-2-70B-Chat, plus human spot checks, and investigates model performance, ablations, and the potential to exceed stochastic parrots. The results reveal notable differences across model families, show signs of cramming on leaderboards, and provide mathematical intuition that larger models can produce novel skill combinations not present in training. The authors propose an ecosystem approach to open, scalable Skill-Mix evaluations for future multi-modal AI capabilities and policy-relevant benchmarking.

Abstract

With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

TL;DR

Skill-Mix offers a flexible evaluation framework to test AI models' ability to compose multiple skills by generating text that demonstrates randomly drawn k-skill subsets across topics. It employs a two-stage generation-and-grading pipeline with auto-grading by GPT-4 and LLaMA-2-70B-Chat, plus human spot checks, and investigates model performance, ablations, and the potential to exceed stochastic parrots. The results reveal notable differences across model families, show signs of cramming on leaderboards, and provide mathematical intuition that larger models can produce novel skill combinations not present in training. The authors propose an ecosystem approach to open, scalable Skill-Mix evaluations for future multi-modal AI capabilities and policy-relevant benchmarking.

Abstract

With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of skills the evaluator repeatedly picks random subsets of skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like , for even modest this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.
Paper Structure (37 sections, 2 equations, 9 figures, 7 tables)

This paper contains 37 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Left: Simplified depiction (with simplified prompt) of the generation stage of our evaluation. The full prompt appears in \ref{['app:prompt_design_generation']}. The generating model is given a topic (sewing) as well as skills (modus ponens, red herring, metaphor), and asked to generate text demonstrating the skills. The full prompt contains skill definitions and examples, which can be found in \ref{['app:prompt_design_generation']}. Right: Simplified depiction (with simplified prompt) of the grading stage of our evaluation. The grading model (not necessarily the same as the generating model) is given the generating model output and grading instructions, and returns pointwise grading. The full grading prompt can be found in \ref{['app:prompt_design_grading']}.
  • Figure 2: Illustration of skill-mix($k$) pipeline. In our experiments, we use $M=100$ for GPT-4 grading and $M=30$ for LLaMA-2 grading. For a more detailed illustration of grading a single piece of generated text, see Figure \ref{['fig:grading_flowchart']}.
  • Figure 3: Illustration of obtaining aggregated grade This illustration depicts the process used to grade a single generated piece of text.
  • Figure 4: Performance of various instruction-tuned student (generating) models on skill-mix($k$) graded by GPT-4. For the accompanying table, see Table \ref{['tab:metrics-graded-by-gpt4']}.
  • Figure 5: Performance of various instruction-tuned student (generating) models on skill-mix($k$) graded by GPT-4. Unlike in Table \ref{['tab:metrics-graded-by-gpt4']}, no point is awarded if a skill is explicitly mentioned in the text. For the accompanying table, see Table \ref{['tab:metrics-graded-by-gpt4-filter']}
  • ...and 4 more figures