Representation Retrieval Learning for Heterogeneous Data Integration
Qi Xu, Annie Qu
TL;DR
This work tackles the challenge of integrating large-scale, heterogeneous data from multiple sources and modalities by introducing Representation Retrieval (R^2) learning, which builds a shared dictionary of representers and retrieves source-specific sparse representations to accommodate covariate shift and posterior drift. It extends to Blockwise Representation Retrieval (BR^2) to handle partial observations across modalities without imputation. A key innovation is the Selective Integration Penalty (SIP), which explicitly promotes more integrative representers, yielding improved theoretical excess risk bounds and empirical performance. Through extensive simulations and real-data analyses, the authors demonstrate that R^2 and BR^2 offer robust, imputation-free strategies for heterogeneous data integration with strong predictive accuracy and practical applicability.
Abstract
In the era of big data, large-scale, multi-source, multi-modality datasets are increasingly ubiquitous, offering unprecedented opportunities for predictive modeling and scientific discovery. However, these datasets often exhibit complex heterogeneity, such as covariates shift, posterior drift, and blockwise missingness, which worsen predictive performance of existing supervised learning algorithms. To address these challenges simultaneously, we propose a novel Representation Retrieval (R2) framework, which integrates a dictionary of representation learning modules (representer dictionary) with data source-specific sparsity-induced machine learning model (learners). Under the R2 framework, we introduce the notion of integrativeness for each representer, and propose a novel Selective Integration Penalty (SIP) to explicitly encourage more integrative representers to improve predictive performance. Theoretically, we show that the excess risk bound of the R2 framework is characterized by the integrativeness of representers, and SIP effectively improves the excess risk. Extensive simulation studies validate the superior performance of R2 framework and the effect of SIP. We further apply our method to two real-world datasets to confirm its empirical success.
