An Interdisciplinary and Cross-Task Review on Missing Data Imputation
Jicong Fan
TL;DR
This work provides a comprehensive, cross-disciplinary synthesis of missing data imputation, tracing foundations from MCAR/MAR/MNAR and EM to modern deep learning, diffusion, GNNs, and large language models. It offers a detailed taxonomy spanning general imputation, special data formats, downstream learning, theory, benchmarks, and domain-specific challenges, emphasizing how imputation interacts with downstream tasks and privacy considerations. The paper highlights key gaps in benchmarking, model selection, and the need for universal, privacy-preserving, and domain-adaptive imputation methods, while outlining a roadmap toward robust, scalable, and interpretable solutions across diverse data types. By linking classical statistical principles with cutting-edge ML approaches, it aims to catalyze cross-domain innovation and enable reliable data analysis in the presence of missing values across science and industry.
Abstract
Missing data is a fundamental challenge in data science, significantly hindering analysis and decision-making across a wide range of disciplines, including healthcare, bioinformatics, social science, e-commerce, and industrial monitoring. Despite decades of research and numerous imputation methods, the literature remains fragmented across fields, creating a critical need for a comprehensive synthesis that connects statistical foundations with modern machine learning advances. This work systematically reviews core concepts-including missingness mechanisms, single versus multiple imputation, and different imputation goals-and examines problem characteristics across various domains. It provides a thorough categorization of imputation methods, spanning classical techniques (e.g., regression, the EM algorithm) to modern approaches like low-rank and high-rank matrix completion, deep learning models (autoencoders, GANs, diffusion models, graph neural networks), and large language models. Special attention is given to methods for complex data types, such as tensors, time series, streaming data, graph-structured data, categorical data, and multimodal data. Beyond methodology, we investigate the crucial integration of imputation with downstream tasks like classification, clustering, and anomaly detection, examining both sequential pipelines and joint optimization frameworks. The review also assesses theoretical guarantees, benchmarking resources, and evaluation metrics. Finally, we identify critical challenges and future directions, emphasizing model selection and hyperparameter optimization, the growing importance of privacy-preserving imputation via federated learning, and the pursuit of generalizable models that can adapt across domains and data types, thereby outlining a roadmap for future research.
