In-Database Data Imputation

Massimo Perini; Milos Nikolic

In-Database Data Imputation

Massimo Perini, Milos Nikolic

TL;DR

This paper tackles missing data by bringing high-quality, model-based imputation into the database, leveraging MICE with in-database learning of stochastic linear regression and Gaussian discriminant analysis. It introduces a cofactor ring framework to compute training aggregates efficiently, enabling computation sharing and factorized evaluation to bypass materializing large joins. The proposed approach, implemented in PostgreSQL and DuckDB, achieves up to two orders of magnitude faster imputation than state-of-the-art external tools while maintaining competitive imputation quality. The work demonstrates strong practical impact by reducing end-to-end imputation time for large, multi-table datasets and providing open-source tooling for researchers and practitioners.

Abstract

Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates (e.g., mean), are computationally efficient but may introduce bias and disrupt variable relationships, leading to inaccurate analyses. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time, limiting their applicability to small datasets. This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method. We adapt this method to exploit computation sharing and a ring abstraction for faster model training. To impute both continuous and categorical values, we develop techniques for in-database learning of stochastic linear regression and Gaussian discriminant analysis models. Our MICE implementations in PostgreSQL and DuckDB outperform alternative MICE implementations and model-based imputation techniques by up to two orders of magnitude in terms of computation time, while maintaining high imputation quality.

In-Database Data Imputation

TL;DR

Abstract

Paper Structure (23 sections, 12 equations, 28 figures, 2 algorithms)

This paper contains 23 sections, 12 equations, 28 figures, 2 algorithms.

Introduction
Background
The MICE Algorithm
In-Database Linear Regression
In-Database Imputation Methods
Stochastic Linear Regression
Database perspective.
Gaussian Discriminant Analysis
Setup.
Training.
Prediction.
Database Perspective.
MICE with Computation Sharing
Implementation
In-Database ML Implementation
...and 8 more sections

Figures (28)

Figure 1: Imputation quality and runtime of Python-based imputation methods on a flight dataset dataset:flight with 5M rows and 20% missing values. Imputation quality is measured as the root mean square error (RMSE) of the linear regression model trained over an imputed dataset to predict flight duration.
Figure 2: MICE with in-database ML
Figure 3: MICE with in-database ML and computation sharing
Figure 5: Flight dataset (cont. only)
Figure 6: Retailer dataset (cont. only)
...and 23 more figures

Theorems & Definitions (4)

Example 1
Example 2
Example 3
Example 4

In-Database Data Imputation

TL;DR

Abstract

In-Database Data Imputation

Authors

TL;DR

Abstract

Table of Contents

Figures (28)

Theorems & Definitions (4)