Table of Contents
Fetching ...

How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

Xiwen Huang, Pierre Pinson

TL;DR

The paper introduces active learning markets as a cost-efficient framework to purchase labels under budget and improvement-threshold constraints in a linear regression setting. It formalizes the market with data pools including labelled, unlabelled, and missing-label data, and analyzes two pricing schemes, buyer-centric and seller-centric, within a single-buyer/multiple-seller setup. By employing two active learning strategies, VBAL and QBCAL, alongside a random-sampling baseline, the authors demonstrate improved data-efficiency in real-world tasks involving real estate pricing and energy forecasting, supported by robustness analyses and statistical validation. The work highlights practical market properties such as budget balance and truthfulness, discusses seller revenue implications, and outlines future directions including streaming data, multi-party competition, and extensions to non-convex models. Overall, the proposed active learning market framework offers a scalable, adaptable approach to data acquisition that can significantly reduce labeling costs while maintaining predictive performance in resource-constrained environments.

Abstract

We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

TL;DR

The paper introduces active learning markets as a cost-efficient framework to purchase labels under budget and improvement-threshold constraints in a linear regression setting. It formalizes the market with data pools including labelled, unlabelled, and missing-label data, and analyzes two pricing schemes, buyer-centric and seller-centric, within a single-buyer/multiple-seller setup. By employing two active learning strategies, VBAL and QBCAL, alongside a random-sampling baseline, the authors demonstrate improved data-efficiency in real-world tasks involving real estate pricing and energy forecasting, supported by robustness analyses and statistical validation. The work highlights practical market properties such as budget balance and truthfulness, discusses seller revenue implications, and outlines future directions including streaming data, multi-party competition, and extensions to non-convex models. Overall, the proposed active learning market framework offers a scalable, adaptable approach to data acquisition that can significantly reduce labeling costs while maintaining predictive performance in resource-constrained environments.

Abstract

We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

Paper Structure

This paper contains 40 sections, 14 equations, 14 figures, 5 tables, 3 algorithms.

Figures (14)

  • Figure 1: Graphical representations of observation, feature and active learning markets, based on both design matrix and response vector (grey: data/features owned by the buyer; white: data/features the sellers may offer).
  • Figure 2: Overview of the active learning market. Random Sampling Corrected (RSC) serves as the baseline method, in which data points are selected randomly rather than through active learning and are purchased only if their labels yield positive model improvement. This baseline is omitted from the diagram for clarity.
  • Figure 3: Buyer-centric pricing approach (the starting point is not purchasing any data point)
  • Figure 4: Seller-centric pricing approach (the starting point is not purchasing any data point)
  • Figure 5: Analyst's side analysis on SC approach: variance reduction per unit cost. Results are represented as cumulative averages (i.e., as the average up to that data point purchased).
  • ...and 9 more figures