Scaling Laws Do Not Scale
Fernando Diaz, Michael Madaio
TL;DR
This work questions the universality of AI scaling laws by showing that performance gains from larger datasets are tied to evaluation metrics and the values of diverse communities impacted by the models. It synthesizes measurement theory and social science perspectives to reveal metric fragility, increasing subpopulation diversity, and potentially conflicting value signals as data and deployments scale. The authors argue that reliance on single, universal metrics can obscure under-performance for many groups and that scaling laws may not generalize across cultures or contexts, especially when metrics are unstable or misaligned with local values. They advocate for interdisciplinary, participatory, and value-sensitive approaches that emphasize local, small-scale design and scrutinize the normative assumptions underlying the push toward ever larger data and models. The overarching message is a call to rethink universality in scale and to develop evaluation practices that reflect the plurality of human values and contexts at global scale.
Abstract
Recent work has advocated for training AI models on ever-larger datasets, arguing that as the size of a dataset increases, the performance of a model trained on that dataset will correspondingly increase (referred to as "scaling laws"). In this paper, we draw on literature from the social sciences and machine learning to critically interrogate these claims. We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output. As the size of datasets used to train large AI models grows and AI systems impact ever larger groups of people, the number of distinct communities represented in training or evaluation datasets grows. It is thus even more likely that communities represented in datasets may have values or preferences not reflected in (or at odds with) the metrics used to evaluate model performance in scaling laws. Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations -- threatening the validity of claims that model performance is improving at scale. We end the paper with implications for AI development: that the motivation for scraping ever-larger datasets may be based on fundamentally flawed assumptions about model performance. That is, models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models. We suggest opportunities for the field to rethink norms and values in AI development, resisting claims for universality of large models, fostering more local, small-scale designs, and other ways to resist the impetus towards scale in AI.
