Table of Contents
Fetching ...

Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

Kasra Borazjani, Payam Abdisarabshali, Naji Khosravan, Seyyedali Hosseinalipour

TL;DR

This paper reveals that evaluating FL under non-IID data with label skew is insufficient for computer vision tasks beyond classification. It introduces embedding-based data heterogeneity, where data embeddings from a task-trained network (penultimate layer) are clustered and distributed to clients via a Dirichlet process, offering a task-aware benchmark of heterogeneity. Across seven Taskonomy tasks and additional datasets, embedding-based splits produce substantially larger loss increases under FL methods than traditional label-based splits, exposing more realistic degradation and exposing limitations of prior benchmarks. The work also provides a framework for assessing task similarity via embedding clusters and discusses implications for single- and multi-task FL, along with future directions such as diverse embedding generators and privacy-preserving embedding computation.

Abstract

Federated Learning (FL) has emerged as one of the prominent paradigms for distributed machine learning (ML). However, it is well-established that its performance can degrade significantly under non-IID (non-independent and identically distributed) data distributions across clients. To study this effect, the existing works predominantly emulate data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the data heterogeneity in computer vision tasks beyond classification, exposing an overlooked gap in the literature. Motivated by this, by utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. For instance, across seven representative computer vision tasks, our embedding-based heterogeneity formulation leads to up to around 60% increase in the observed loss under FedAvg, indicating that it more accurately exposes the performance degradation caused by data heterogeneity. We further unveil a series of open research directions that can be pursued.

Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

TL;DR

This paper reveals that evaluating FL under non-IID data with label skew is insufficient for computer vision tasks beyond classification. It introduces embedding-based data heterogeneity, where data embeddings from a task-trained network (penultimate layer) are clustered and distributed to clients via a Dirichlet process, offering a task-aware benchmark of heterogeneity. Across seven Taskonomy tasks and additional datasets, embedding-based splits produce substantially larger loss increases under FL methods than traditional label-based splits, exposing more realistic degradation and exposing limitations of prior benchmarks. The work also provides a framework for assessing task similarity via embedding clusters and discusses implications for single- and multi-task FL, along with future directions such as diverse embedding generators and privacy-preserving embedding computation.

Abstract

Federated Learning (FL) has emerged as one of the prominent paradigms for distributed machine learning (ML). However, it is well-established that its performance can degrade significantly under non-IID (non-independent and identically distributed) data distributions across clients. To study this effect, the existing works predominantly emulate data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the data heterogeneity in computer vision tasks beyond classification, exposing an overlooked gap in the literature. Motivated by this, by utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. For instance, across seven representative computer vision tasks, our embedding-based heterogeneity formulation leads to up to around 60% increase in the observed loss under FedAvg, indicating that it more accurately exposes the performance degradation caused by data heterogeneity. We further unveil a series of open research directions that can be pursued.

Paper Structure

This paper contains 34 sections, 3 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Dirichlet-generated data distribution for $N = 25$ clients and $16$ classes with homogeneous prior probabilities (i.e., $p_i = 1/|\mathcal{C}|,~ \forall 1 \leq i \leq |\mathcal{C}|$). In the case of $\alpha=10^6$, the data is homogeneously distributed across clients, whereas in the case of $\alpha=0.1$, the majority of each client's data comprises two labels at most. The decrease in the homogeneity of the clients' data is also observed by comparing the data distributions generated by $\alpha=10^3$ and $\alpha=10$.
  • Figure 2: Visualization of the tasks from the Taskonomy dataset that have been used in our experiments. Unless otherwise stated, the loss functions used for the tasks are as follows: $\ell_1$ loss for Euclidean Depth Estimation, 2D Edges, Surface Normals, and 3D Keypoints; mean squared error for Reshading; and cross-entropy for Scene Classification and Semantic Segmentation.
  • Figure 3: Class-based vs. Embedding-based distribution: Comparison of how performing Dirichlet distribution over the datapoints' labels (equivalently, the scene class feature) and the extracted embeddings affect the performance of FL. As it is shown, distributing the datapoints based on the class/label of the scene they correspond to (the left plot in each box) has not resulted in a significant change in the performance across various values of $\alpha$. However, distributing the datapoints based on the clusters formed by the extracted embeddings (the right plot in each box) with the Dirichlet distributions of various $\alpha$ parameters has created a performance gap across the possible scenarios with $\alpha=0.1$ (i.e., the most heterogeneous case) resulting in the worst performance and $\alpha=1000$ (i.e., the least heterogeneous case) resulting in the best performance. The results further show that class-based data heterogeneity overestimates FL performance: it yields seemingly small loss values because it does not perturb the task-relevant feature space. In contrast, embedding-based heterogeneity increases the loss, revealing the true sensitivity of FL methods to realistic data variations in computer vision tasks. Also, it can be seen that only the class-based experiments on the Scene Classification task (i.e., the bottom-most box) exhibit a notable performance change with varying $\alpha$, mirroring the trend seen in the embedding-based experiments. This indicates that the most salient feature for the Scene Classification task (i.e., the semantic class label) is naturally captured in the embeddings as well, leading to consistent changes in loss across both class-based and embedding-based settings.
  • Figure 4: In our approach, data heterogeneity is induced from the unique task perspective. At first (i.e., in Step 1), data points are fed to a pre-trained neural network trained on the task (this can be any off-the-shelf model as long as the model is trained for the task of interest). Afterwards (i.e., in Step 2), the embeddings of data points are extracted from the penultimate layer of the model. Then (i.e., in Step 3), the embeddings are clustered using clustering algorithms such as K-means, which unveils the similarity/dissimilarity of the data points from the task perspective. Finally (i.e., in Step 4), Dirichlet distribution is applied on the clustered data to emulate the local datasets of clients, treating the datapoints in the same cluster as having the same group (analogous to labels in classification tasks).
  • Figure 5: The figure illustrates embedding clusters extracted from a Scene Classification task model, where each color represents a specific class. It is evident that the embeddings create a separable space corresponding to the scene classification labels. This demonstrates that distributing data points across FL clients using the clusters/groups depicted in the figure is equivalent to distributing data based on the original scene class labels. The observed alignment between the embedding-based clustering and the class labels underscores the effectiveness of using embeddings as a method to induce task-specific data heterogeneity in FL.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Remark 1: Implications on Designing Effective Countermeasures to Address Data Heterogeneity in FL for Vision Tasks
  • Remark 2: Implications on the Concept of Task Similarity
  • Remark 3: Implications on Multi-Task FL and its Extensions
  • Remark 4: Implications on the Complexity of Data Partitioning in FL