Table of Contents
Fetching ...

Towards Modeling Data Quality and Machine Learning Model Performance

Usman Anjum, Chris Trentman, Elrod Caden, Justin Zhan

TL;DR

The paper addresses how data uncertainty and non-determinism affect ML model performance in trust-critical domains by introducing the Deterministic-Non-Deterministic Ratio ($DDR$) as a data-quality metric and $DDR$-accuracy curves to quantify performance under varying data determinism. It develops a bottom-up synthetic data framework with $DDR$-invariant standardization and hit-and-run sampling to generate controlled datasets, and defines the trustworthiness portfolio $p_{M,Y}=accuracy \times DDR$ with $p_M=\int_{0}^{1} accuracy(DDR)\, d(DDR)$ (area under the curve) as an aggregate measure. Experimental results across 10 models (5 regression, 5 classification) show that increasing $DDR$ generally improves accuracy, with some models like MLPs displaying robustness to uncertainty; decision-tree and nearest-neighbor methods show more sensitivity. The work advances a data-centric perspective for evaluating model performance under data quality variations and lays groundwork for extensions to broader model classes, data noises, and explainability under uncertainty.

Abstract

Understanding the effect of uncertainty and noise in data on machine learning models (MLM) is crucial in developing trust and measuring performance. In this paper, a new model is proposed to quantify uncertainties and noise in data on MLMs. Using the concept of signal-to-noise ratio (SNR), a new metric called deterministic-non-deterministic ratio (DDR) is proposed to formulate performance of a model. Using synthetic data in experiments, we show how accuracy can change with DDR and how we can use DDR-accuracy curves to determine performance of a model.

Towards Modeling Data Quality and Machine Learning Model Performance

TL;DR

The paper addresses how data uncertainty and non-determinism affect ML model performance in trust-critical domains by introducing the Deterministic-Non-Deterministic Ratio () as a data-quality metric and -accuracy curves to quantify performance under varying data determinism. It develops a bottom-up synthetic data framework with -invariant standardization and hit-and-run sampling to generate controlled datasets, and defines the trustworthiness portfolio with (area under the curve) as an aggregate measure. Experimental results across 10 models (5 regression, 5 classification) show that increasing generally improves accuracy, with some models like MLPs displaying robustness to uncertainty; decision-tree and nearest-neighbor methods show more sensitivity. The work advances a data-centric perspective for evaluating model performance under data quality variations and lays groundwork for extensions to broader model classes, data noises, and explainability under uncertainty.

Abstract

Understanding the effect of uncertainty and noise in data on machine learning models (MLM) is crucial in developing trust and measuring performance. In this paper, a new model is proposed to quantify uncertainties and noise in data on MLMs. Using the concept of signal-to-noise ratio (SNR), a new metric called deterministic-non-deterministic ratio (DDR) is proposed to formulate performance of a model. Using synthetic data in experiments, we show how accuracy can change with DDR and how we can use DDR-accuracy curves to determine performance of a model.

Paper Structure

This paper contains 11 sections, 17 equations, 11 figures, 2 tables, 2 algorithms.

Figures (11)

  • Figure 1: NMSE-Based Accuracy of Ordinary Least Squares Regression vs. Matrix of Features DDR
  • Figure 2: NMSE-Based Accuracy of Decision Tree Regression vs. Matrix of Features DDR
  • Figure 3: NMSE-Based Accuracy of KNearest Neighbors Regression vs. Matrix of Features DDR
  • Figure 4: NMSE-Based Accuracy of Linear Support Vector Regression vs. Matrix of Features DDR
  • Figure 5: F1-Based Accuracy of Binary Logistic Regression 2-Class Classification vs. Matrix of Features DDR
  • ...and 6 more figures