DPERC: Direct Parameter Estimation for Mixed Data
Tuan L. Vo, Quan Huu Do, Uyen Dang, Thu Nguyen, Pål Halvorsen, Michael A. Riegler, Binh T. Nguyen
TL;DR
DPERC addresses covariance estimation from incomplete mixed data by extending direct parameter estimation (DPER) to use categorical features as artificial class labels under an equal-covariance assumption, enabling more accurate estimation of the continuous-feature covariance $\boldsymbol{\Sigma}$. The method selects a good categorical feature via a theoretical criterion based on $\delta^{(g)} = d^{(g)}_{\mathbf{c}} + n_g \Delta^{(g)}$ and applies a multi-class DPER to estimate off-diagonal entries, while handling both single-class and multi-class scenarios. Experimental results on four real datasets show DPERC improves estimation accuracy and produces correlation heatmaps that more closely match ground truth than several imputation-based and direct-estimation baselines, with notable gains in multi-class settings. The approach provides a practical, computation-efficient pathway for direct covariance estimation with missing data and informs reliable correlation visualization in mixed-type datasets.
Abstract
The covariance matrix is a foundation in numerous statistical and machine-learning applications such as Principle Component Analysis, Correlation Heatmap, etc. However, missing values within datasets present a formidable obstacle to accurately estimating this matrix. While imputation methods offer one avenue for addressing this challenge, they often entail a trade-off between computational efficiency and estimation accuracy. Consequently, attention has shifted towards direct parameter estimation, given its precision and reduced computational burden. In this paper, we propose Direct Parameter Estimation for Randomly Missing Data with Categorical Features (DPERC), an efficient approach for direct parameter estimation tailored to mixed data that contains missing values within continuous features. Our method is motivated by leveraging information from categorical features, which can significantly enhance covariance matrix estimation for continuous features. Our approach effectively harnesses the information embedded within mixed data structures. Through comprehensive evaluations of diverse datasets, we demonstrate the competitive performance of DPERC compared to various contemporary techniques. In addition, we also show by experiments that DPERC is a valuable tool for visualizing the correlation heatmap.
