Developing robust methods to handle missing data in real-world applications effectively
Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal
TL;DR
This paper tackles the pervasive challenge of missing data across diverse modalities by proposing a comprehensive research agenda to develop robust methods that accommodate MCAR, MAR, and MNAR mechanisms. It outlines a multi-objective plan: evaluating existing methods through literature reviews and empirical studies, enhancing diffusion-based imputation with mask information to cover MAR/MNAR, extending approaches to categorical and heterogeneous data, and pursuing multimodal missingness with GNN-based representation learning. The work emphasizes scalability, cross-modal applicability, and practical impact for real-world datasets, including tabular, sensor, time-series, and multimodal contexts. By addressing these gaps, the research aims to provide actionable, robust strategies for imputing missing data across industries and data modalities.
Abstract
Missing data is a pervasive challenge spanning diverse data types, including tabular, sensor data, time-series, images and so on. Its origins are multifaceted, resulting in various missing mechanisms. Prior research in this field has predominantly revolved around the assumption of the Missing Completely At Random (MCAR) mechanism. However, Missing At Random (MAR) and Missing Not At Random (MNAR) mechanisms, though equally prevalent, have often remained underexplored despite their significant influence. This PhD project presents a comprehensive research agenda designed to investigate the implications of diverse missing data mechanisms. The principal aim is to devise robust methodologies capable of effectively handling missing data while accommodating the unique characteristics of MCAR, MAR, and MNAR mechanisms. By addressing these gaps, this research contributes to an enriched understanding of the challenges posed by missing data across various industries and data modalities. It seeks to provide practical solutions that enable the effective management of missing data, empowering researchers and practitioners to leverage incomplete datasets confidently.
