Data Collection and Labeling Techniques for Machine Learning
Qianyu Huang, Tongfang Zhao
TL;DR
This paper addresses the data bottlenecks in deploying machine learning by surveying data collection and labeling techniques from both ML and data management perspectives. It maps data acquisition (discovery, augmentation, generation) and labeling (crowd-based and weak supervision) into cohesive workflows, highlighting how data management practices like cleaning, standardization, and versioning can be integrated. Key contributions include a structured synthesis of current methods, analysis of data quality challenges, and directions toward end-to-end, scalable pipelines, including semi-supervised avenues and bias-mitigation approaches. The work emphasizes practical impact through design principles for integrated platforms and pipelines that support robust, scalable ML systems in real-world settings.
Abstract
Data collection and labeling are critical bottlenecks in the deployment of machine learning applications. With the increasing complexity and diversity of applications, the need for efficient and scalable data collection and labeling techniques has become paramount. This paper provides a review of the state-of-the-art methods in data collection, data labeling, and the improvement of existing data and models. By integrating perspectives from both the machine learning and data management communities, we aim to provide a holistic view of the current landscape and identify future research directions.
