DataLens: ML-Oriented Interactive Tabular Data Quality Dashboard
Mohamed Abdelaal, Samuel Lokadjaja, Arne Kreuz, Harald Schöning
TL;DR
DataLens tackles the data quality problem in ML pipelines by unifying automated profiling, ML-based and rule-based error detection, and repair in an interactive dashboard. It combines REST APIs for tooling, a user-in-the-loop for labeling and rule validation, and an iterative cleaning module that uses Bayesian optimization via Optuna to maximize downstream performance metrics (e.g., $MSE$ for regression and $F1$ for classification). DataSheets, MLflow, and Delta Lake enable reproducibility and version control of data and experiments. Demonstrations on NASA and Beers datasets show progressive improvements in downstream metrics as iterations increase, with performance approaching that achieved on ground-truth data. The work offers a scalable framework that tightly couples data cleaning with ML evaluation and provenance tracking.
Abstract
Maintaining high data quality is crucial for reliable data analysis and machine learning (ML). However, existing data quality management tools often lack automation, interactivity, and integration with ML workflows. This demonstration paper introduces DataLens, a novel interactive dashboard designed to streamline and automate the data quality management process for tabular data. DataLens integrates a suite of data profiling, error detection, and repair tools, including statistical, rule-based, and ML-based methods. It features a user-in-the-loop module for interactive rule validation, data labeling, and custom rule definition, enabling domain experts to guide the cleaning process. Furthermore, DataLens implements an iterative cleaning module that automatically selects optimal cleaning tools based on downstream ML model performance. To ensure reproducibility, DataLens generates DataSheets capturing essential metadata and integrates with MLflow and Delta Lake for experiment tracking and data version control. This demonstration showcases DataLens's capabilities in effectively identifying and correcting data errors, improving data quality for downstream tasks, and promoting reproducibility in data cleaning pipelines.
