Table of Contents
Fetching ...

Optimal Pricing for Data-Augmented AutoML Marketplaces

Minbiao Han, Jonathan Light, Steven Xia, Sainyam Galhotra, Raul Castro Fernandez, Haifeng Xu

TL;DR

This paper proposes a pragmatic data-augmented AutoML market that seamlessly integrates with existing cloud-based AutoML platforms such as Google's Vertex AI and Amazon's SageMaker and establishes an economically sustainable framework for monetizing external data.

Abstract

Organizations often lack sufficient data to effectively train machine learning (ML) models, while others possess valuable data that remains underutilized. Data markets promise to unlock substantial value by matching data suppliers with demand from ML consumers. However, market design involves addressing intricate challenges, including data pricing, fairness, robustness, and strategic behavior. In this paper, we propose a pragmatic data-augmented AutoML market that seamlessly integrates with existing cloud-based AutoML platforms such as Google's Vertex AI and Amazon's SageMaker. Unlike standard AutoML solutions, our design automatically augments buyer-submitted training data with valuable external datasets, pricing the resulting models based on their measurable performance improvements rather than computational costs as the status quo. Our key innovation is a pricing mechanism grounded in the instrumental value - the marginal model quality improvement - of externally sourced data. This approach bypasses direct dataset pricing complexities, mitigates strategic buyer behavior, and accommodates diverse buyer valuations through menu-based options. By integrating automated data and model discovery, our solution not only enhances ML outcomes but also establishes an economically sustainable framework for monetizing external data.

Optimal Pricing for Data-Augmented AutoML Marketplaces

TL;DR

This paper proposes a pragmatic data-augmented AutoML market that seamlessly integrates with existing cloud-based AutoML platforms such as Google's Vertex AI and Amazon's SageMaker and establishes an economically sustainable framework for monetizing external data.

Abstract

Organizations often lack sufficient data to effectively train machine learning (ML) models, while others possess valuable data that remains underutilized. Data markets promise to unlock substantial value by matching data suppliers with demand from ML consumers. However, market design involves addressing intricate challenges, including data pricing, fairness, robustness, and strategic behavior. In this paper, we propose a pragmatic data-augmented AutoML market that seamlessly integrates with existing cloud-based AutoML platforms such as Google's Vertex AI and Amazon's SageMaker. Unlike standard AutoML solutions, our design automatically augments buyer-submitted training data with valuable external datasets, pricing the resulting models based on their measurable performance improvements rather than computational costs as the status quo. Our key innovation is a pricing mechanism grounded in the instrumental value - the marginal model quality improvement - of externally sourced data. This approach bypasses direct dataset pricing complexities, mitigates strategic buyer behavior, and accommodates diverse buyer valuations through menu-based options. By integrating automated data and model discovery, our solution not only enhances ML outcomes but also establishes an economically sustainable framework for monetizing external data.
Paper Structure (22 sections, 8 theorems, 23 equations, 13 figures, 3 algorithms)

This paper contains 22 sections, 8 theorems, 23 equations, 13 figures, 3 algorithms.

Key Result

Proposition 4.1

The optimal buyer policy can be computed via DP in $O(Q^2T)$ time.

Figures (13)

  • Figure 1: A Markov chain model where the buyer makes a decision to continue or stop at every node.
  • Figure 2: Architecture of our market implementation. The discovery engine performs augmentation-model discovery to provide the sequence of performance metrics to the pricing engine.
  • Figure 3: Compare discovery techniques on 1K input tasks.
  • Figure 4: Profit Benchmark for School Data.
  • Figure 5: The underlying prior distribution $\mu^* \in \Delta^n$ is a randomly generated distribution. Size of performance metric $|Q|= 10$, e.g., $Q = \{100\%, 99\%, \cdots, 91\%\}$, number of buyer types is $n=5$, number of time steps is $T=15$, learning rate $\eta = 1/\sqrt{t}$. $x$-axis represents the learning rounds, and $y$-axis represents the KL divergence between the learned distribution and the underlying true distribution.
  • ...and 8 more figures

Theorems & Definitions (14)

  • Proposition 4.1
  • Theorem 4.2
  • Definition 5.1: Cost of Estimation Error (CEE)
  • Theorem 5.2: The Cost of Estimation Error
  • Proposition A.1
  • proof
  • Proposition B.1
  • proof
  • Lemma B.2
  • proof
  • ...and 4 more