Collaborative Learning with Different Labeling Functions

Yuyang Deng; Mingda Qiao

Collaborative Learning with Different Labeling Functions

Yuyang Deng, Mingda Qiao

TL;DR

This work studies learning $n$ classifiers across $n$ distributions under different labeling functions, aiming to minimize the total labeled data while achieving $\epsilon$-accuracy on each distribution. It introduces $(k,\epsilon)$-realizability and an $(G,k)$-augmentation of the hypothesis class to enable ERM-based learning with a VC-dimension bound, yielding a near-optimal sample complexity of $O(kd\log(n/k) + n\log n)$ (for fixed constants). The paper proves NP-hardness of ERM over the augmented class for $k\ge3$ and provides two computationally efficient special cases: identical marginals and $2$-refutable hypothesis classes via approximate coloring, including a bipartite ($k=2$) scenario with favorable complexity. These results delineate when collaborative learning with heterogeneous labeling is statistically feasible and when computational barriers necessitate structure-based algorithms, with practical implications for federated, multi-task, and distributed learning contexts.

Abstract

We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the $n$ data distributions, while minimizing the number of samples drawn from them in total. Unlike in the usual collaborative learning setup, it is not assumed that there exists a single classifier that is simultaneously accurate for all distributions. We show that, when the data distributions satisfy a weaker realizability assumption, which appeared in [Crammer and Mansour, 2012] in the context of multi-task learning, sample-efficient learning is still feasible. We give a learning algorithm based on Empirical Risk Minimization (ERM) on a natural augmentation of the hypothesis class, and the analysis relies on an upper bound on the VC dimension of this augmented class. In terms of the computational efficiency, we show that ERM on the augmented hypothesis class is NP-hard, which gives evidence against the existence of computationally efficient learners in general. On the positive side, for two special cases, we give learners that are both sample- and computationally-efficient.

Collaborative Learning with Different Labeling Functions

TL;DR

This work studies learning

classifiers across

distributions under different labeling functions, aiming to minimize the total labeled data while achieving

-accuracy on each distribution. It introduces

-realizability and an

-augmentation of the hypothesis class to enable ERM-based learning with a VC-dimension bound, yielding a near-optimal sample complexity of

(for fixed constants). The paper proves NP-hardness of ERM over the augmented class for

and provides two computationally efficient special cases: identical marginals and

-refutable hypothesis classes via approximate coloring, including a bipartite (

) scenario with favorable complexity. These results delineate when collaborative learning with heterogeneous labeling is statistically feasible and when computational barriers necessitate structure-based algorithms, with practical implications for federated, multi-task, and distributed learning contexts.

Abstract

We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the

data distributions, while minimizing the number of samples drawn from them in total. Unlike in the usual collaborative learning setup, it is not assumed that there exists a single classifier that is simultaneously accurate for all distributions. We show that, when the data distributions satisfy a weaker realizability assumption, which appeared in [Crammer and Mansour, 2012] in the context of multi-task learning, sample-efficient learning is still feasible. We give a learning algorithm based on Empirical Risk Minimization (ERM) on a natural augmentation of the hypothesis class, and the analysis relies on an upper bound on the VC dimension of this augmented class. In terms of the computational efficiency, we show that ERM on the augmented hypothesis class is NP-hard, which gives evidence against the existence of computationally efficient learners in general. On the positive side, for two special cases, we give learners that are both sample- and computationally-efficient.

Paper Structure (40 sections, 19 theorems, 48 equations, 3 algorithms)

This paper contains 40 sections, 19 theorems, 48 equations, 3 algorithms.

Introduction
Problem Setup
Our Results
A sufficient condition for sample-efficient learning.
A sample complexity lower bound.
Intractability of ERM and proper learning.
Efficient algorithms for special cases.
Related Work
Collaborative learning.
Mixture learning from batches.
Computational hardness of learning.
Approximate coloring.
Discussion on Open Problems
Tighter sample complexity bounds.
A stronger hardness result.
...and 25 more sections

Key Result

Theorem 1

Suppose that $\mathcal{D}_1, \mathcal{D}_2, \ldots, \mathcal{D}_n$ are $(k, \epsilon)$-realizable with respect to hypothesis class $\mathcal{F}$. For any $\delta > 0$, there is an $(8\epsilon, \delta)$-PAC algorithm with sample complexity

Theorems & Definitions (53)

Definition 1: $(k, \epsilon)$-Realizability
Theorem 1
Definition 2: $(G, k)$-Augmentation
Theorem 2
Definition 3: ERM Oracle
Definition 4: Regular Hypothesis Family
Remark 5
Remark 6
Theorem 3
Theorem 4
...and 43 more

Collaborative Learning with Different Labeling Functions

TL;DR

Abstract

Collaborative Learning with Different Labeling Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (53)