Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function
Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan
TL;DR
The paper introduces the human generalization function as a framework for modeling how deployment decisions for LLMs are shaped by human beliefs about capabilities. It collects a large dataset of 18,972 belief-update examples across 79 tasks and demonstrates that human generalizations are predictable using NLP models, with BERT performing best. It then evaluates how well LLMs align with these human generalizations, finding that larger models may underperform on the questions people select for deployment when the cost of errors is high due to misalignment. This work highlights the need to evaluate LLMs via human deployment distributions and suggests directions to improve alignment and mitigate risks in high-stakes settings.
Abstract
What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.
