Imputation-free Learning of Tabular Data with Missing Values using Incremental Feature Partitions in Transformer
Manar D. Samad, Kazi Fuad B. Akhter, Shourav B. Rabbani, Ibna Kowsar
TL;DR
The paper tackles the challenge of learning from tabular data with missing values without resorting to imputation. It introduces IFIAL, an imputation-free incremental attention learning framework that uses two attention masks to exclude missing values and incremental feature partitions to manage missing-rate heterogeneity. Across 17 diverse OpenML datasets and multiple missing-value types (MCAR, MNAR, natural), IFIAL with partition size $K=\frac{d}{2}$ consistently outperforms 11 imputation-based and imputation-free baselines in AUC, while also reducing computational overhead by avoiding imputation. The approach preserves data integrity, demonstrates robustness to high missing rates, and offers practical implications for healthcare and other data-rich domains where missing data are prevalent. Limitations include potential edge cases where imputation-based methods may be more efficient at very low missing rates and very large datasets."
Abstract
Tabular data sets with varying missing values are prepared for machine learning using an arbitrary imputation strategy. Synthetic values generated by imputation models often raise concerns regarding data quality and the reliability of data-driven outcomes. To address these concerns, this article proposes an imputation-free incremental attention learning (IFIAL) method for tabular data with missing values. A pair of attention masks is derived and retrofitted to a transformer to directly streamline tabular data without imputing or initializing missing values. The proposed method incrementally learns partitions of overlapping and fixed-size feature sets to enhance the performance of the transformer. The average classification performance rank order across 17 diverse tabular data sets highlights the superiority of IFIAL over 11 state-of-the-art learning methods with or without missing value imputations. Additional experiments corroborate the robustness of IFIAL to varying types and proportions of missing data, demonstrating its superiority over methods that rely on explicit imputations. A feature partition size equal to one-half the original feature space yields the best trade-off between computational efficiency and predictive performance. IFIAL is one of the first solutions that enables deep attention models to learn directly from tabular data, eliminating the need to impute missing values. %without the need for imputing missing values. The source code for this paper is publicly available.
