Table of Contents
Fetching ...

Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale

A. Feder Cooper

TL;DR

This work investigates reliability in machine learning at scale by linking rigorous measurement with law and policy. It develops a three-part framework: (1) identifying and mitigating arbitrariness in ML, (2) taming randomness in scalable uncertainty estimation and optimization, and (3) evaluating generative-AI systems with copyright and policy implications. A core contribution is the epistemic hyperparameter optimization (EHPO) framework, formalized with modal logic to guard against inconsistent conclusions from hyperparameter searches. It also presents scalable, exact minibatch MH methods (TunaMH) and distributed ordering (CD-GraB) to bridge reliability with practicality. In the generative-AI domain, the work measures extractable memorization in production LLMs, demonstrates open CC-based training for diffusion models (CommonCanvas), and offers a supply-chain lens to copyright questions, underscoring the need for interdisciplinary tools to govern AI responsibly.

Abstract

To develop rigorous knowledge about ML models -- and the systems in which they are embedded -- we need reliable measurements. But reliable measurement is fundamentally challenging, and touches on issues of reproducibility, scalability, uncertainty quantification, epistemology, and more. This dissertation addresses criteria needed to take reliability seriously: both criteria for designing meaningful metrics, and for methodologies that ensure that we can dependably and efficiently measure these metrics at scale and in practice. In doing so, this dissertation articulates a research vision for a new field of scholarship at the intersection of machine learning, law, and policy. Within this frame, we cover topics that fit under three different themes: (1) quantifying and mitigating sources of arbitrariness in ML, (2) taming randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability, and (3) providing methods for evaluating generative-AI systems, with specific focuses on quantifying memorization in language models and training latent diffusion models on open-licensed data. By making contributions in these three themes, this dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately and inescapably bound up with research in law and policy. These different disciplines pose similar research questions about reliable measurement in machine learning. They are, in fact, two complementary sides of the same research vision, which, broadly construed, aims to construct machine-learning systems that cohere with broader societal values.

Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale

TL;DR

This work investigates reliability in machine learning at scale by linking rigorous measurement with law and policy. It develops a three-part framework: (1) identifying and mitigating arbitrariness in ML, (2) taming randomness in scalable uncertainty estimation and optimization, and (3) evaluating generative-AI systems with copyright and policy implications. A core contribution is the epistemic hyperparameter optimization (EHPO) framework, formalized with modal logic to guard against inconsistent conclusions from hyperparameter searches. It also presents scalable, exact minibatch MH methods (TunaMH) and distributed ordering (CD-GraB) to bridge reliability with practicality. In the generative-AI domain, the work measures extractable memorization in production LLMs, demonstrates open CC-based training for diffusion models (CommonCanvas), and offers a supply-chain lens to copyright questions, underscoring the need for interdisciplinary tools to govern AI responsibly.

Abstract

To develop rigorous knowledge about ML models -- and the systems in which they are embedded -- we need reliable measurements. But reliable measurement is fundamentally challenging, and touches on issues of reproducibility, scalability, uncertainty quantification, epistemology, and more. This dissertation addresses criteria needed to take reliability seriously: both criteria for designing meaningful metrics, and for methodologies that ensure that we can dependably and efficiently measure these metrics at scale and in practice. In doing so, this dissertation articulates a research vision for a new field of scholarship at the intersection of machine learning, law, and policy. Within this frame, we cover topics that fit under three different themes: (1) quantifying and mitigating sources of arbitrariness in ML, (2) taming randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability, and (3) providing methods for evaluating generative-AI systems, with specific focuses on quantifying memorization in language models and training latent diffusion models on open-licensed data. By making contributions in these three themes, this dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately and inescapably bound up with research in law and policy. These different disciplines pose similar research questions about reliable measurement in machine learning. They are, in fact, two complementary sides of the same research vision, which, broadly construed, aims to construct machine-learning systems that cohere with broader societal values.
Paper Structure (373 sections, 19 theorems, 373 equations, 98 figures, 14 tables, 13 algorithms)

This paper contains 373 sections, 19 theorems, 373 equations, 98 figures, 14 tables, 13 algorithms.

Key Result

Theorem 1

Suppose that the set of allowable hyper-HPs $\mathcal{C}$ of $H$ is constrained, such that any two allowable random-search distributions $\mu$ and $\nu$ have Renyi-$\infty$-divergence at most a constant, i.e. $D_{\infty}(\mu \| \nu) \le \gamma$. The $(K,R)$-defended random-search EHPO of Definition

Figures (98)

  • Figure 1: Ph.D. projects organized by theme. Some projects do not fit neatly into these divisions cooper2022arpalaufer2023fouryearscooper2021tecnologica, and many projects cross boundaries. Notably, Appendix \ref{['chapter:accountability']}cooper2022accountability touches on all three themes.
  • Figure 2: Running different sets of experiments for training the VGG-16 architecture to classify images in CIFAR-10. Both sets of experiments test SGD, Heavy Ball momentum, and Adam. The experiments on the right use one configuration for Adam, and the experiments on the left use another. In isolation, each of these sets of experiments leads to a conclusion that, when considered together, result in a logical contradiction.
  • Figure 3: 100 bootstrapped random forest models show models can be very consistent in predictions $\hat{y}$ for some individuals (Ind. 1) and arbitrary for others (Ind. 2). In this example, 50 models result in predictions that suggest Ind. 2 will recidivate (i.e., commit a crime again) and 50 that suggest they will not. Their prediction is arbitrary.
  • Figure 4: Training 101 bootstrapped random forest models on COMPAS 10 different times. Our estimates for self-consistency ($x$-axis) are very stable, as evidenced by the tightness of the error bars. In this setting, roughly 20% of classification decisions (indicated with the blue dotted line) in COMPAS are predictably and consistently arbitrary, resembling Individual 2 in Figure \ref{['fig:intro:vote']}.
  • Figure 5: Exact MCMC composes a proposal step (to produce new samples ${\bm{\theta}}'$) with an MH correction to remove bias by deciding to accept/reject the new sample as the next stage in the Markov chain (${\bm{\theta}}_{t+1}$). Our exact, scalable algorithms use 1) proposals that leverage stochastic gradients of the potential, $\tilde{\nabla}U$zhang2020amagold; 2) MH corrections that use minibatches of data examples for computations with the potential. $\tilde{\Delta}U$ (Chapter \ref{['chapter:tunamh']}).
  • ...and 93 more figures

Theorems & Definitions (64)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Theorem 1
  • Definition 8
  • Definition 9
  • ...and 54 more