Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies
Qi Liu, Wanjing Ma
TL;DR
This work investigates how data corruption, including missing and noisy data, degrades performance in two distinct ML paradigms: NLP-SL and Signal-RL. It finds that performance follows an exponential-like diminishing-returns model, S = a (1 - e^{-\lambda (1 - p)}) with a dependent on the corruption rate, and that RL is more sensitive to corruption than NLP. The study also analyzes imputation trade-offs via imputation advantage heatmaps, showing that accurate imputation can help but noisy imputation can hurt, with boundary shapes differing by task. It further demonstrates that enlarging datasets provides only partial resilience to corruption; the required data to offset quality losses grows roughly exponentially, and practical rules (e.g., ~30% data being critical for traffic signals) emerge for prioritizing data collection. Collectively, these findings offer guidelines for robustness in preprocessing, imputation, and data acquisition across noisy ML applications, while outlining avenues for extending the work to additional domains and more sophisticated imputation techniques.
Abstract
Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning. This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects through two experimental setups: supervised learning with NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal optimization (Signal-RL). We analyze the relationship between data corruption levels and model performance, evaluate the effectiveness of data imputation methods, and assess the utility of enlarging datasets to address data corruption. Our results show that model performance under data corruption follows a diminishing return curve, modeled by the exponential function. Missing data, while detrimental, is less harmful than noisy data, which causes severe performance degradation and training instability, particularly in sequential decision-making tasks like Signal-RL. Imputation strategies involve a trade-off: they recover missing information but may introduce noise. Their effectiveness depends on imputation accuracy and corruption ratio. We identify distinct regions in the imputation advantage heatmap, including an "imputation advantageous corner" and an "imputation disadvantageous edge" and classify tasks as "noise-sensitive" or "noise-insensitive" based on their decision boundaries. Furthermore, we find that increasing dataset size mitigates but cannot fully overcome the effects of data corruption. The marginal utility of additional data diminishes as corruption increases. An empirical rule emerges: approximately 30% of the data is critical for determining performance, while the remaining 70% has minimal impact. These findings provide actionable insights into data preprocessing, imputation strategies, and data collection practices, guiding the development of robust machine learning systems in noisy environments.
