Table of Contents
Fetching ...

DataLens: ML-Oriented Interactive Tabular Data Quality Dashboard

Mohamed Abdelaal, Samuel Lokadjaja, Arne Kreuz, Harald Schöning

TL;DR

DataLens tackles the data quality problem in ML pipelines by unifying automated profiling, ML-based and rule-based error detection, and repair in an interactive dashboard. It combines REST APIs for tooling, a user-in-the-loop for labeling and rule validation, and an iterative cleaning module that uses Bayesian optimization via Optuna to maximize downstream performance metrics (e.g., $MSE$ for regression and $F1$ for classification). DataSheets, MLflow, and Delta Lake enable reproducibility and version control of data and experiments. Demonstrations on NASA and Beers datasets show progressive improvements in downstream metrics as iterations increase, with performance approaching that achieved on ground-truth data. The work offers a scalable framework that tightly couples data cleaning with ML evaluation and provenance tracking.

Abstract

Maintaining high data quality is crucial for reliable data analysis and machine learning (ML). However, existing data quality management tools often lack automation, interactivity, and integration with ML workflows. This demonstration paper introduces DataLens, a novel interactive dashboard designed to streamline and automate the data quality management process for tabular data. DataLens integrates a suite of data profiling, error detection, and repair tools, including statistical, rule-based, and ML-based methods. It features a user-in-the-loop module for interactive rule validation, data labeling, and custom rule definition, enabling domain experts to guide the cleaning process. Furthermore, DataLens implements an iterative cleaning module that automatically selects optimal cleaning tools based on downstream ML model performance. To ensure reproducibility, DataLens generates DataSheets capturing essential metadata and integrates with MLflow and Delta Lake for experiment tracking and data version control. This demonstration showcases DataLens's capabilities in effectively identifying and correcting data errors, improving data quality for downstream tasks, and promoting reproducibility in data cleaning pipelines.

DataLens: ML-Oriented Interactive Tabular Data Quality Dashboard

TL;DR

DataLens tackles the data quality problem in ML pipelines by unifying automated profiling, ML-based and rule-based error detection, and repair in an interactive dashboard. It combines REST APIs for tooling, a user-in-the-loop for labeling and rule validation, and an iterative cleaning module that uses Bayesian optimization via Optuna to maximize downstream performance metrics (e.g., for regression and for classification). DataSheets, MLflow, and Delta Lake enable reproducibility and version control of data and experiments. Demonstrations on NASA and Beers datasets show progressive improvements in downstream metrics as iterations increase, with performance approaching that achieved on ground-truth data. The work offers a scalable framework that tightly couples data cleaning with ML evaluation and provenance tracking.

Abstract

Maintaining high data quality is crucial for reliable data analysis and machine learning (ML). However, existing data quality management tools often lack automation, interactivity, and integration with ML workflows. This demonstration paper introduces DataLens, a novel interactive dashboard designed to streamline and automate the data quality management process for tabular data. DataLens integrates a suite of data profiling, error detection, and repair tools, including statistical, rule-based, and ML-based methods. It features a user-in-the-loop module for interactive rule validation, data labeling, and custom rule definition, enabling domain experts to guide the cleaning process. Furthermore, DataLens implements an iterative cleaning module that automatically selects optimal cleaning tools based on downstream ML model performance. To ensure reproducibility, DataLens generates DataSheets capturing essential metadata and integrates with MLflow and Delta Lake for experiment tracking and data version control. This demonstration showcases DataLens's capabilities in effectively identifying and correcting data errors, improving data quality for downstream tasks, and promoting reproducibility in data cleaning pipelines.

Paper Structure

This paper contains 6 sections, 5 figures.

Figures (5)

  • Figure 1: Architecture of DataLens
  • Figure 2: Main window of DataLens
  • Figure 3: Evaluation of labeling ML-based tools
  • Figure 4: Distribution of detections across various attributes of the NASA dataset
  • Figure 5: Impact of the number of search iterations