On the Power of Source Screening for Learning Shared Feature Extractors

Leo; Wang; Connor Mclaughlin; Lili Su

On the Power of Source Screening for Learning Shared Feature Extractors

Leo, Wang, Connor Mclaughlin, Lili Su

TL;DR

This work shows that learning a shared linear subspace across many heterogeneous sources need not pool all data; carefully screening for an informative subpopulation can achieve minimax-optimal subspace estimation. The authors formalize admissible subpopulations via spectral conditions on the diversity matrix and provide both genie-aided and empirical algorithms to identify them. In a structured orthogonal-head setting, screening achieves uniform diversity eigenvalues and provable rate guarantees, with empirical results on synthetic and real data confirming gains over full-population baselines. The approach offers a principled pathway to mitigate negative transfer and improve generalization in multi-source representation learning.

Abstract

Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good'' collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.

On the Power of Source Screening for Learning Shared Feature Extractors

TL;DR

Abstract

Paper Structure (24 sections, 9 theorems, 60 equations, 4 figures, 3 tables, 3 algorithms)

This paper contains 24 sections, 9 theorems, 60 equations, 4 figures, 3 tables, 3 algorithms.

Introduction
Related Work
Problem Setup
Main Results
Potentials of Source Screening
On the Fundamentals of Source Screening
Desired Sub-population and Existence
Algorithm in the Genie-Aided Selection
Empirical Subpopulation Search
Numerical Experiments
Synthetic Data
Real-world Data.
Conclusion
Related Work on Client Selection in Federated Learning
Standard Assumptions and Statistical Rates of Existing Work
...and 9 more sections

Key Result

Theorem 1

Consider a system with $M$ clients and $N$ data points in total. Assume $x_{ij}\sim N(0, I_d)$ and $\xi_{ij}\sim N(0, 1)$ independently for $i\in[M]$ and $j\in [n_i]$. Then for the model in Eq. (eq:model-sup), when $d\ge (1+\rho_1)k$ for a constant $\rho_1>0$, we have where $\sqrt{\frac{d}{N\lambda_k}} \wedge 1 = \min \{\frac{d}{N\lambda_k}, 1\}$.

Figures (4)

Figure 1: Subspace reconstruction error ($||\mathrm{sin}(B^*, \widehat{B})||$) in a clustered setting. While pooling the full population maximizes sample size, uneven representation introduces bias. Conversely, a smaller balanced subset recovers the latent basis more effectively across all tested estimators. See Section \ref{['sect:experiments']} for setup and estimator details.
Figure 2: (Left) Performance on Clustered $\alpha_i$ setting. (Right) Performance on Heterogeneous Gaussian $\alpha_i$ setting.
Figure 3: (Left) Ablation over latent dimensionality $k$. (Right) Ablation over full dimensionality.
Figure 4: (Left) Ablation over clustered clients assignment proportion. (Right) Ablation over number of clients $M$.

Theorems & Definitions (15)

Definition 1: Principal angle distance
Theorem 1: niu2024collaborative (informal)
Definition 2: Admissible subpopulation
Theorem 2: Minimax Statistical Optimal
Theorem 3
Theorem 4
Theorem 5
Theorem 6: Weyl’s inequalities
proof
proof
...and 5 more

On the Power of Source Screening for Learning Shared Feature Extractors

TL;DR

Abstract

On the Power of Source Screening for Learning Shared Feature Extractors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (15)