Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies

Ritwik Gupta; Leah Walker; Rodolfo Corona; Stephanie Fu; Suzanne Petryk; Janet Napolitano; Trevor Darrell; Andrew W. Reddie

Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies

Ritwik Gupta, Leah Walker, Rodolfo Corona, Stephanie Fu, Suzanne Petryk, Janet Napolitano, Trevor Darrell, Andrew W. Reddie

TL;DR

The importance of considering dataset size and content as essential factors in assessing the risks posed by models both today and in the future is illustrated and the risk posed by over-regulating reactively is emphasized.

Abstract

Current regulations on powerful AI capabilities are narrowly focused on "foundation" or "frontier" models. However, these terms are vague and inconsistently defined, leading to an unstable foundation for governance efforts. Critically, policy debates often fail to consider the data used with these models, despite the clear link between data and model performance. Even (relatively) "small" models that fall outside the typical definitions of foundation and frontier models can achieve equivalent outcomes when exposed to sufficiently specific datasets. In this work, we illustrate the importance of considering dataset size and content as essential factors in assessing the risks posed by models both today and in the future. More broadly, we emphasize the risk posed by over-regulating reactively and provide a path towards careful, quantitative evaluation of capabilities that can lead to a simplified regulatory environment.

Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies

TL;DR

Abstract

Paper Structure (24 sections, 3 figures, 2 tables)

This paper contains 24 sections, 3 figures, 2 tables.

The Shortcomings of Today’s AI Governance
Definitional Challenges and Flawed Limits in AI Governance
An Unstable Definition Foundation
Capability and Model Size are not Strictly Correlated
A Misplaced Focus on FLOPs
Optimizations reverse trends.
Efficient methods develop rapidly.
Data is Missing from the Conversation
Big Data to Usable Information
Data-Centrism Opens new Analytic Frontiers
Retrieval
Retrieval from training data.
Retrieval from previously unseen data.
Derivation
Assumptions and Limitations
...and 9 more sections

Figures (3)

Figure 1: The effectiveness of a model isn't solely determined by its size or computational complexity. (Left) Despite PaliGemma having an order of magnitude more parameters than UniLSeg, it performs 9.4 mIoU points worse on the common RefCOCO (val) benchmark. (Right) Larger models do not necessarily perform better than smaller ones on the common MMLU benchmark.
Figure 2: Model size and FLOPs are insufficient determinants of capability. Pixelfly, a recent advancement in efficient model training, can maintain performance on ImageNet across many types of models while reducing their parameter counts and training FLOPs 68% and 200% on average, respectively. Each pair of dots represents a Mixer-S/B and ViT-S/B model and its Pixelfly variant.
Figure 3: Top-1 accuracy and GFLOPs for various models on the ImageNet-1K benchmark. The rapid pace of development of models results in better performance with fewer FLOPs.

Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies

TL;DR

Abstract

Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (3)