Table of Contents
Fetching ...

Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

Manal Rahal, Bestoun S. Ahmed, Gergely Szabados, Torgny Fornstedt, Jorgen Samuelsson

TL;DR

The paper tackles the challenge of poor data quality limiting ML performance by introducing a quality-centric data evaluation framework that fuses domain-defined quality measurements with unsupervised clustering to identify quality-based data groups. It then validates these groups by training per-cluster predictive models to predict retention time $t_R$, demonstrating that high-quality data—characterized by high $SNR$, low peak skewness, longer sequences, and stable $t_R$—yield better ML performance. The approach provides explainable insights by linking cluster-level data characteristics to model accuracy and offers a feedback loop to data source controllers to improve future data collection. The framework is generalizable, requiring minimal human intervention, and shows promise for reducing experimental cost and time while guiding data collection and quality control in chromatography and other domains.

Abstract

Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.

Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

TL;DR

The paper tackles the challenge of poor data quality limiting ML performance by introducing a quality-centric data evaluation framework that fuses domain-defined quality measurements with unsupervised clustering to identify quality-based data groups. It then validates these groups by training per-cluster predictive models to predict retention time , demonstrating that high-quality data—characterized by high , low peak skewness, longer sequences, and stable —yield better ML performance. The approach provides explainable insights by linking cluster-level data characteristics to model accuracy and offers a feedback loop to data source controllers to improve future data collection. The framework is generalizable, requiring minimal human intervention, and shows promise for reducing experimental cost and time while guiding data collection and quality control in chromatography and other domains.

Abstract

Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.

Paper Structure

This paper contains 26 sections, 2 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Separation process using liquid chromatography and a spectrometry detector.
  • Figure 2: Proposed quality-centric data evaluation framework. The pipeline starts with raw data pre-processing, followed by quality measurements defined by domain experts ($Q1-Qm$). The framework then applies unsupervised classification to generate quality-based clusters, which undergo ML model training and evaluation. Results feed back to data source controllers to enhance future data collection and quality.
  • Figure 3: Chromatographic signal where the noise and the peak height are labeled.
  • Figure 4: Symmetric peak, skewness = 1. Skewness is measured at x=0.5.
  • Figure 5: Distribution of quality measurements per ASO compound in G1 dataset: (a) $\Delta t_R$ showing retention time variation, (b) skewness indicating peak symmetry, and (c) SNR demonstrating signal strength. The range of values has been adjusted for visualization purposes.
  • ...and 5 more figures