Table of Contents
Fetching ...

On the Power of Source Screening for Learning Shared Feature Extractors

Leo, Wang, Connor Mclaughlin, Lili Su

TL;DR

This work shows that learning a shared linear subspace across many heterogeneous sources need not pool all data; carefully screening for an informative subpopulation can achieve minimax-optimal subspace estimation. The authors formalize admissible subpopulations via spectral conditions on the diversity matrix and provide both genie-aided and empirical algorithms to identify them. In a structured orthogonal-head setting, screening achieves uniform diversity eigenvalues and provable rate guarantees, with empirical results on synthetic and real data confirming gains over full-population baselines. The approach offers a principled pathway to mitigate negative transfer and improve generalization in multi-source representation learning.

Abstract

Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good'' collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.

On the Power of Source Screening for Learning Shared Feature Extractors

TL;DR

This work shows that learning a shared linear subspace across many heterogeneous sources need not pool all data; carefully screening for an informative subpopulation can achieve minimax-optimal subspace estimation. The authors formalize admissible subpopulations via spectral conditions on the diversity matrix and provide both genie-aided and empirical algorithms to identify them. In a structured orthogonal-head setting, screening achieves uniform diversity eigenvalues and provable rate guarantees, with empirical results on synthetic and real data confirming gains over full-population baselines. The approach offers a principled pathway to mitigate negative transfer and improve generalization in multi-source representation learning.

Abstract

Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good'' collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.
Paper Structure (24 sections, 9 theorems, 60 equations, 4 figures, 3 tables, 3 algorithms)

This paper contains 24 sections, 9 theorems, 60 equations, 4 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

Consider a system with $M$ clients and $N$ data points in total. Assume $x_{ij}\sim N(0, I_d)$ and $\xi_{ij}\sim N(0, 1)$ independently for $i\in[M]$ and $j\in [n_i]$. Then for the model in Eq. (eq:model-sup), when $d\ge (1+\rho_1)k$ for a constant $\rho_1>0$, we have where $\sqrt{\frac{d}{N\lambda_k}} \wedge 1 = \min \{\frac{d}{N\lambda_k}, 1\}$.

Figures (4)

  • Figure 1: Subspace reconstruction error ($||\mathrm{sin}(B^*, \widehat{B})||$) in a clustered setting. While pooling the full population maximizes sample size, uneven representation introduces bias. Conversely, a smaller balanced subset recovers the latent basis more effectively across all tested estimators. See Section \ref{['sect:experiments']} for setup and estimator details.
  • Figure 2: (Left) Performance on Clustered $\alpha_i$ setting. (Right) Performance on Heterogeneous Gaussian $\alpha_i$ setting.
  • Figure 3: (Left) Ablation over latent dimensionality $k$. (Right) Ablation over full dimensionality.
  • Figure 4: (Left) Ablation over clustered clients assignment proportion. (Right) Ablation over number of clients $M$.

Theorems & Definitions (15)

  • Definition 1: Principal angle distance
  • Theorem 1: niu2024collaborative (informal)
  • Definition 2: Admissible subpopulation
  • Theorem 2: Minimax Statistical Optimal
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6: Weyl’s inequalities
  • proof
  • proof
  • ...and 5 more