Table of Contents
Fetching ...

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan

TL;DR

The paper introduces the human generalization function as a framework for modeling how deployment decisions for LLMs are shaped by human beliefs about capabilities. It collects a large dataset of 18,972 belief-update examples across 79 tasks and demonstrates that human generalizations are predictable using NLP models, with BERT performing best. It then evaluates how well LLMs align with these human generalizations, finding that larger models may underperform on the questions people select for deployment when the cost of errors is high due to misalignment. This work highlights the need to evaluate LLMs via human deployment distributions and suggests directions to improve alignment and mitigate risks in high-stakes settings.

Abstract

What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

TL;DR

The paper introduces the human generalization function as a framework for modeling how deployment decisions for LLMs are shaped by human beliefs about capabilities. It collects a large dataset of 18,972 belief-update examples across 79 tasks and demonstrates that human generalizations are predictable using NLP models, with BERT performing best. It then evaluates how well LLMs align with these human generalizations, finding that larger models may underperform on the questions people select for deployment when the cost of errors is high due to misalignment. This work highlights the need to evaluate LLMs via human deployment distributions and suggests directions to improve alignment and mitigate risks in high-stakes settings.

Abstract

What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.
Paper Structure (15 sections, 14 equations, 8 figures, 5 tables)

This paper contains 15 sections, 14 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Classically, ML models are deployed to perform tasks based on benchmark performance (left). When deployment is based on human generalization (right), a human decision maker first interacts with a model to assess its capabilities, and then the model is deployed to perform tasks the decision maker believes it will perform well on. The model's deployed performance depends on how well aligned its capabilities are with the human generalization function.
  • Figure 2: Qualitative examples about question pairs and predicted belief changes.
  • Figure 3: Left: The human generalization function is sparse. Most pairs of randomly sampled questions result in no belief change. Right: The bandit is effective at identifying instances where beliefs change. The x-axis ranks bandit predictions by likelihood of belief change, while the y-axis shows the fraction actually containing belief changes (using a held-out set). Over time, the bandit becomes effective at finding non-zero belief changes.
  • Figure 4: Examples of human generalization failures due to misalignment of Llama-2 (70B).
  • Figure 5: The distribution of prior beliefs, posterior beliefs, and the changes in beliefs for survey respondents after the first stage of survey collection.
  • ...and 3 more figures