Precision Adaptive Imputation Network : An Unified Technique for Mixed Datasets
Harsh Joshi, Rajeshwari Mistri, Manasi Mali, Nachiket Kapure, Parul Kumari
TL;DR
PAIN tackles missing data in mixed-type datasets by proposing a three-layer, adaptive imputation framework that fuses statistics, random forests, and autoencoders to achieve robust reconstruction under MAR and MNAR patterns. The method relies on adaptive weighting and a refinement stage to preserve data distributions, with evaluation on MAR-induced missingness across multiple UCI-like datasets showing superior NRMSE, MAE, Sinkhorn divergence, MMD, and predictive variance preservation relative to traditional imputers and MissForest. The results demonstrate PAIN's strong performance on high-dimensional, correlated data, though at increased computational cost. The work contributes a unified, hybrid imputation framework and a comprehensive evaluation paradigm that informs future methodological advances in data reconstruction for mixed-type datasets.
Abstract
The challenge of missing data remains a significant obstacle across various scientific domains, necessitating the development of advanced imputation techniques that can effectively address complex missingness patterns. This study introduces the Precision Adaptive Imputation Network (PAIN), a novel algorithm designed to enhance data reconstruction by dynamically adapting to diverse data types, distributions, and missingness mechanisms. PAIN employs a tri-step process that integrates statistical methods, random forests, and autoencoders, ensuring balanced accuracy and efficiency in imputation. Through rigorous evaluation across multiple datasets, including those characterized by high-dimensional and correlated features, PAIN consistently outperforms traditional imputation methods, such as mean and median imputation, as well as other advanced techniques like MissForest. The findings highlight PAIN's superior ability to preserve data distributions and maintain analytical integrity, particularly in complex scenarios where missingness is not completely at random. This research not only contributes to a deeper understanding of missing data reconstruction but also provides a critical framework for future methodological innovations in data science and machine learning, paving the way for more effective handling of mixed-type datasets in real-world applications.
