Table of Contents
Fetching ...

Improving Noise Robustness through Abstractions and its Impact on Machine Learning

Alfredo Ibias, Karol Capala, Varun Ravi Varma, Anna Drozdz, Jose Sousa

TL;DR

This work tackles the pervasive issue of noise in ML, including adversarial scenarios, by introducing data abstractions that map numeric features to discrete bins to dampen noise. The authors evaluate four classical ML methods (logistic regression, random forest, SVM, and ANN) on six binary classification datasets using abstractions generated via static binning, quantiles, ROC-based discretization, and K-means clustering, with a focus on an ANN as the primary learner. Key findings show that ROC-based abstractions and quantiles can improve noise robustness with only modest or negligible accuracy loss, and that abstractions generally reduce variance under noisy conditions, though dataset-specific guidance is essential. The study suggests practical benefits for robust ML in real-world noisy data settings and outlines future work to extend abstractions to other data types, treat adversarial attacks more directly, and optimize abstraction updating strategies.

Abstract

Noise is a fundamental problem in learning theory with huge effects in the application of Machine Learning (ML) methods, due to real world data tendency to be noisy. Additionally, introduction of malicious noise can make ML methods fail critically, as is the case with adversarial attacks. Thus, finding and developing alternatives to improve robustness to noise is a fundamental problem in ML. In this paper, we propose a method to deal with noise: mitigating its effect through the use of data abstractions. The goal is to reduce the effect of noise over the model's performance through the loss of information produced by the abstraction. However, this information loss comes with a cost: it can result in an accuracy reduction due to the missing information. First, we explored multiple methodologies to create abstractions, using the training dataset, for the specific case of numerical data and binary classification tasks. We also tested how these abstractions can affect robustness to noise with several experiments that explore the robustness of an Artificial Neural Network to noise when trained using raw data \emph{vs} when trained using abstracted data. The results clearly show that using abstractions is a viable approach for developing noise robust ML methods.

Improving Noise Robustness through Abstractions and its Impact on Machine Learning

TL;DR

This work tackles the pervasive issue of noise in ML, including adversarial scenarios, by introducing data abstractions that map numeric features to discrete bins to dampen noise. The authors evaluate four classical ML methods (logistic regression, random forest, SVM, and ANN) on six binary classification datasets using abstractions generated via static binning, quantiles, ROC-based discretization, and K-means clustering, with a focus on an ANN as the primary learner. Key findings show that ROC-based abstractions and quantiles can improve noise robustness with only modest or negligible accuracy loss, and that abstractions generally reduce variance under noisy conditions, though dataset-specific guidance is essential. The study suggests practical benefits for robust ML in real-world noisy data settings and outlines future work to extend abstractions to other data types, treat adversarial attacks more directly, and optimize abstraction updating strategies.

Abstract

Noise is a fundamental problem in learning theory with huge effects in the application of Machine Learning (ML) methods, due to real world data tendency to be noisy. Additionally, introduction of malicious noise can make ML methods fail critically, as is the case with adversarial attacks. Thus, finding and developing alternatives to improve robustness to noise is a fundamental problem in ML. In this paper, we propose a method to deal with noise: mitigating its effect through the use of data abstractions. The goal is to reduce the effect of noise over the model's performance through the loss of information produced by the abstraction. However, this information loss comes with a cost: it can result in an accuracy reduction due to the missing information. First, we explored multiple methodologies to create abstractions, using the training dataset, for the specific case of numerical data and binary classification tasks. We also tested how these abstractions can affect robustness to noise with several experiments that explore the robustness of an Artificial Neural Network to noise when trained using raw data \emph{vs} when trained using abstracted data. The results clearly show that using abstractions is a viable approach for developing noise robust ML methods.
Paper Structure (24 sections, 4 equations, 4 figures, 1 table)

This paper contains 24 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Results of the first experiment per dataset. Horizontal lines mark the raw data results for comparison.
  • Figure 2: Mean of abstraction boundary variance for different noise strengths.
  • Figure 3: Variation of accuracy scores per noise scenario, for the different datasets, and per abstraction, with respect to Clean Training and Testing (scenario where we show the actual accuracy scores obtained).
  • Figure 4: Average accuracy (top) and standard deviation (bottom) between noise scenarios.

Theorems & Definitions (1)

  • Definition 1