On the Power of Source Screening for Learning Shared Feature Extractors
Leo, Wang, Connor Mclaughlin, Lili Su
TL;DR
This work shows that learning a shared linear subspace across many heterogeneous sources need not pool all data; carefully screening for an informative subpopulation can achieve minimax-optimal subspace estimation. The authors formalize admissible subpopulations via spectral conditions on the diversity matrix and provide both genie-aided and empirical algorithms to identify them. In a structured orthogonal-head setting, screening achieves uniform diversity eigenvalues and provable rate guarantees, with empirical results on synthetic and real data confirming gains over full-population baselines. The approach offers a principled pathway to mitigate negative transfer and improve generalization in multi-source representation learning.
Abstract
Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good'' collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.
