Table of Contents
Fetching ...

Imputation of missing values in multi-view data

Wouter van Loon, Marjolein Fokkema, Frank de Vos, Marisa Koini, Reinhold Schmidt, Mark de Rooij

TL;DR

The paper tackles the challenge of imputing missing values in multi-view data by introducing a meta-level imputation strategy that operates in a dimension-reduced space produced by StaPLR. By imputing in the reduced $n \times V$ space rather than the full feature space, the method dramatically lowers computation while preserving or enhancing predictive performance and view-selection quality. Across simulations and a real multi-view MRI dataset, meta-level imputation—especially meta-level missForest—consistently matches or exceeds feature-level approaches and enables the use of computationally demanding methods like PMM in high-dimensional settings. The work demonstrates practical benefits for integrative biomedical analyses where missing entire views are common, offering a scalable, interpretable framework for view selection and prediction.

Abstract

Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This may lead to very large quantities of missing data which, especially when combined with high-dimensionality, can make the application of conditional imputation methods computationally infeasible. However, the multi-view structure could be leveraged to reduce the complexity and computational load of imputation. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets and a real data application. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.

Imputation of missing values in multi-view data

TL;DR

The paper tackles the challenge of imputing missing values in multi-view data by introducing a meta-level imputation strategy that operates in a dimension-reduced space produced by StaPLR. By imputing in the reduced space rather than the full feature space, the method dramatically lowers computation while preserving or enhancing predictive performance and view-selection quality. Across simulations and a real multi-view MRI dataset, meta-level imputation—especially meta-level missForest—consistently matches or exceeds feature-level approaches and enables the use of computationally demanding methods like PMM in high-dimensional settings. The work demonstrates practical benefits for integrative biomedical analyses where missing entire views are common, offering a scalable, interpretable framework for view selection and prediction.

Abstract

Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This may lead to very large quantities of missing data which, especially when combined with high-dimensionality, can make the application of conditional imputation methods computationally infeasible. However, the multi-view structure could be leveraged to reduce the complexity and computational load of imputation. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets and a real data application. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.
Paper Structure (19 sections, 2 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 2 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: An example of how missing views can be handled through either classical feature concatenation (left), or through StaPLR (right). The blank areas in the data represent missing values. In this example, we assume there are three views, namely $\bm{X}^{(1)}$, $\bm{X}^{(2)}$, and $\bm{X}^{(3)}$. We denote the number of features in each view by $m_1$, $m_2$, and $m_3$, respectively. Assume that there are $l$ observations which have missing values on $\bm{X}^{(2)}$. In the case of feature concatenation, we would have to either impute $lm_2$ missing values, or entirely discard $l(m_1 + m_3)$ observed values. However, in the proposed missing data handling method for StaPLR, only $l$ values need to be imputed.
  • Figure 2: Test accuracy with 90% missingness. CDA = complete data analysis; CCA = complete case analysis; MI = feature-level mean imputation; MF = feature-level missForest; mMI = meta-level mean imputation; mMF = meta-level missForest; mPMM = meta-level predictive mean matching, generating $\bm{Z}$ once; cvPMM = meta-level predictive mean matching, generating $\bm{Z}$ five times; MOFA = multi-factor omics analysis imputation. $V_1$ is the smallest view, consisting of 5 features. $V_4$ is the largest view, consisting of 5000 features.
  • Figure 3: Mean squared error of probabilities (MSEP) with 90% missingness. CDA = complete data analysis; CCA = complete case analysis; MI = feature-level mean imputation; MF = feature-level missForest; mMI = meta-level mean imputation; mMF = meta-level missForest; mPMM = meta-level predictive mean matching, generating $\bm{Z}$ once; cvPMM = meta-level predictive mean matching, generating $\bm{Z}$ five times; MOFA = multi-factor omics analysis imputation. $V_1$ is the smallest view, consisting of 5 features. $V_4$ is the largest view, consisting of 5000 features.
  • Figure 4: Test accuracy with 50% missingness. CDA = complete data analysis; CCA = complete case analysis; MI = feature-level mean imputation; MF = feature-level missForest; mMI = meta-level mean imputation; mMF = meta-level missForest; mPMM = meta-level predictive mean matching, generating $\bm{Z}$ once; cvPMM = meta-level predictive mean matching, generating $\bm{Z}$ five times; MOFA = multi-factor omics analysis imputation. $V_1$ is the smallest view, consisting of 5 features. $V_4$ is the largest view, consisting of 5000 features.
  • Figure 5: Mean squared error of probabilities (MSEP) with 50% missingness. CDA = complete data analysis; CCA = complete case analysis; MI = feature-level mean imputation; MF = feature-level missForest; mMI = meta-level mean imputation; mMF = meta-level missForest; mPMM = meta-level predictive mean matching, generating $\bm{Z}$ once; cvPMM = meta-level predictive mean matching, generating $\bm{Z}$ five times; MOFA = multi-factor omics analysis imputation. $V_1$ is the smallest view, consisting of 5 features. $V_4$ is the largest view, consisting of 5000 features.
  • ...and 9 more figures