Improving Noise Robustness through Abstractions and its Impact on Machine Learning
Alfredo Ibias, Karol Capala, Varun Ravi Varma, Anna Drozdz, Jose Sousa
TL;DR
This work tackles the pervasive issue of noise in ML, including adversarial scenarios, by introducing data abstractions that map numeric features to discrete bins to dampen noise. The authors evaluate four classical ML methods (logistic regression, random forest, SVM, and ANN) on six binary classification datasets using abstractions generated via static binning, quantiles, ROC-based discretization, and K-means clustering, with a focus on an ANN as the primary learner. Key findings show that ROC-based abstractions and quantiles can improve noise robustness with only modest or negligible accuracy loss, and that abstractions generally reduce variance under noisy conditions, though dataset-specific guidance is essential. The study suggests practical benefits for robust ML in real-world noisy data settings and outlines future work to extend abstractions to other data types, treat adversarial attacks more directly, and optimize abstraction updating strategies.
Abstract
Noise is a fundamental problem in learning theory with huge effects in the application of Machine Learning (ML) methods, due to real world data tendency to be noisy. Additionally, introduction of malicious noise can make ML methods fail critically, as is the case with adversarial attacks. Thus, finding and developing alternatives to improve robustness to noise is a fundamental problem in ML. In this paper, we propose a method to deal with noise: mitigating its effect through the use of data abstractions. The goal is to reduce the effect of noise over the model's performance through the loss of information produced by the abstraction. However, this information loss comes with a cost: it can result in an accuracy reduction due to the missing information. First, we explored multiple methodologies to create abstractions, using the training dataset, for the specific case of numerical data and binary classification tasks. We also tested how these abstractions can affect robustness to noise with several experiments that explore the robustness of an Artificial Neural Network to noise when trained using raw data \emph{vs} when trained using abstracted data. The results clearly show that using abstractions is a viable approach for developing noise robust ML methods.
