How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets
Xiwen Huang, Pierre Pinson
TL;DR
The paper introduces active learning markets as a cost-efficient framework to purchase labels under budget and improvement-threshold constraints in a linear regression setting. It formalizes the market with data pools including labelled, unlabelled, and missing-label data, and analyzes two pricing schemes, buyer-centric and seller-centric, within a single-buyer/multiple-seller setup. By employing two active learning strategies, VBAL and QBCAL, alongside a random-sampling baseline, the authors demonstrate improved data-efficiency in real-world tasks involving real estate pricing and energy forecasting, supported by robustness analyses and statistical validation. The work highlights practical market properties such as budget balance and truthfulness, discusses seller revenue implications, and outlines future directions including streaming data, multi-party competition, and extensions to non-convex models. Overall, the proposed active learning market framework offers a scalable, adaptable approach to data acquisition that can significantly reduce labeling costs while maintaining predictive performance in resource-constrained environments.
Abstract
We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.
