Optimal Pricing for Data-Augmented AutoML Marketplaces

Minbiao Han; Jonathan Light; Steven Xia; Sainyam Galhotra; Raul Castro Fernandez; Haifeng Xu

Optimal Pricing for Data-Augmented AutoML Marketplaces

Minbiao Han, Jonathan Light, Steven Xia, Sainyam Galhotra, Raul Castro Fernandez, Haifeng Xu

TL;DR

This paper proposes a pragmatic data-augmented AutoML market that seamlessly integrates with existing cloud-based AutoML platforms such as Google's Vertex AI and Amazon's SageMaker and establishes an economically sustainable framework for monetizing external data.

Abstract

Organizations often lack sufficient data to effectively train machine learning (ML) models, while others possess valuable data that remains underutilized. Data markets promise to unlock substantial value by matching data suppliers with demand from ML consumers. However, market design involves addressing intricate challenges, including data pricing, fairness, robustness, and strategic behavior. In this paper, we propose a pragmatic data-augmented AutoML market that seamlessly integrates with existing cloud-based AutoML platforms such as Google's Vertex AI and Amazon's SageMaker. Unlike standard AutoML solutions, our design automatically augments buyer-submitted training data with valuable external datasets, pricing the resulting models based on their measurable performance improvements rather than computational costs as the status quo. Our key innovation is a pricing mechanism grounded in the instrumental value - the marginal model quality improvement - of externally sourced data. This approach bypasses direct dataset pricing complexities, mitigates strategic buyer behavior, and accommodates diverse buyer valuations through menu-based options. By integrating automated data and model discovery, our solution not only enhances ML outcomes but also establishes an economically sustainable framework for monetizing external data.

Optimal Pricing for Data-Augmented AutoML Marketplaces

TL;DR

Abstract

Paper Structure (22 sections, 8 theorems, 23 equations, 13 figures, 3 algorithms)

This paper contains 22 sections, 8 theorems, 23 equations, 13 figures, 3 algorithms.

Introduction
Related Work
Data-Augmented AutoML Market Architecture
Finding the Optimal Pricing Mechanism
Relaxing Market's Prior Knowledge through Learning
Evaluation
An Implementation of the Market and Experimental Setup
RQ1: How effective is our market in finding high-quality augmentations and models?
RQ2: How do different pricing schemes perform?
RQ3: How to learn the prior distribution $\mu$?
Conclusion
Details on the Augmentation and Model Discovery Algorithm
Choosing Candidate Augmentation Throughout Iterations
Performance Results
Omitted Proofs from Section \ref{['sec:economic']}
...and 7 more sections

Key Result

Proposition 4.1

The optimal buyer policy can be computed via DP in $O(Q^2T)$ time.

Figures (13)

Figure 1: A Markov chain model where the buyer makes a decision to continue or stop at every node.
Figure 2: Architecture of our market implementation. The discovery engine performs augmentation-model discovery to provide the sequence of performance metrics to the pricing engine.
Figure 3: Compare discovery techniques on 1K input tasks.
Figure 4: Profit Benchmark for School Data.
Figure 5: The underlying prior distribution $\mu^* \in \Delta^n$ is a randomly generated distribution. Size of performance metric $|Q|= 10$, e.g., $Q = \{100\%, 99\%, \cdots, 91\%\}$, number of buyer types is $n=5$, number of time steps is $T=15$, learning rate $\eta = 1/\sqrt{t}$. $x$-axis represents the learning rounds, and $y$-axis represents the KL divergence between the learned distribution and the underlying true distribution.
...and 8 more figures

Theorems & Definitions (14)

Proposition 4.1
Theorem 4.2
Definition 5.1: Cost of Estimation Error (CEE)
Theorem 5.2: The Cost of Estimation Error
Proposition A.1
proof
Proposition B.1
proof
Lemma B.2
proof
...and 4 more

Optimal Pricing for Data-Augmented AutoML Marketplaces

TL;DR

Abstract

Optimal Pricing for Data-Augmented AutoML Marketplaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (14)