Table of Contents
Fetching ...

Cross-Silo Federated Learning for Multi-Tier Networks with Vertical and Horizontal Data Partitioning

Anirban Das, Timothy Castiglia, Shiqiang Wang, Stacy Patterson

TL;DR

This work addresses learning over data that are vertically partitioned across silos and horizontally partitioned within silos in a cross-silo federated setting. It introduces Tiered Decentralized Coordinate Descent (TDCD), which interleaves coordinated descent across silo hubs with local SGD inside silos, reducing communication by performing Q local updates between rounds and exchanging embeddings among silos. The authors provide a convergence analysis showing a bound with rate O($1/\sqrt{R}$) under standard assumptions, and discuss how the bound scales with the number of silos $N$ and the local-update parameter $Q$, with special cases recovering established results for local SGD and vertical FL. Empirical evaluations on CIFAR-10, MIMIC-III, and ModelNet40 demonstrate TDCD’s stability to increased partitioning and its communication-efficiency benefits, especially in latency-dominated networks. The results offer practical guidance for setting the local-update count $Q$ in different latency regimes and highlight TDCD’s potential for scalable, privacy-preserving learning in multi-tier organizations.

Abstract

We consider federated learning in tiered communication networks. Our network model consists of a set of silos, each holding a vertical partition of the data. Each silo contains a hub and a set of clients, with the silo's vertical data shard partitioned horizontally across its clients. We propose Tiered Decentralized Coordinate Descent (TDCD), a communication-efficient decentralized training algorithm for such two-tiered networks. The clients in each silo perform multiple local gradient steps before sharing updates with their hub to reduce communication overhead. Each hub adjusts its coordinates by averaging its workers' updates, and then hubs exchange intermediate updates with one another. We present a theoretical analysis of our algorithm and show the dependence of the convergence rate on the number of vertical partitions and the number of local updates. We further validate our approach empirically via simulation-based experiments using a variety of datasets and objectives.

Cross-Silo Federated Learning for Multi-Tier Networks with Vertical and Horizontal Data Partitioning

TL;DR

This work addresses learning over data that are vertically partitioned across silos and horizontally partitioned within silos in a cross-silo federated setting. It introduces Tiered Decentralized Coordinate Descent (TDCD), which interleaves coordinated descent across silo hubs with local SGD inside silos, reducing communication by performing Q local updates between rounds and exchanging embeddings among silos. The authors provide a convergence analysis showing a bound with rate O() under standard assumptions, and discuss how the bound scales with the number of silos and the local-update parameter , with special cases recovering established results for local SGD and vertical FL. Empirical evaluations on CIFAR-10, MIMIC-III, and ModelNet40 demonstrate TDCD’s stability to increased partitioning and its communication-efficiency benefits, especially in latency-dominated networks. The results offer practical guidance for setting the local-update count in different latency regimes and highlight TDCD’s potential for scalable, privacy-preserving learning in multi-tier organizations.

Abstract

We consider federated learning in tiered communication networks. Our network model consists of a set of silos, each holding a vertical partition of the data. Each silo contains a hub and a set of clients, with the silo's vertical data shard partitioned horizontally across its clients. We propose Tiered Decentralized Coordinate Descent (TDCD), a communication-efficient decentralized training algorithm for such two-tiered networks. The clients in each silo perform multiple local gradient steps before sharing updates with their hub to reduce communication overhead. Each hub adjusts its coordinates by averaging its workers' updates, and then hubs exchange intermediate updates with one another. We present a theoretical analysis of our algorithm and show the dependence of the convergence rate on the number of vertical partitions and the number of local updates. We further validate our approach empirically via simulation-based experiments using a variety of datasets and objectives.

Paper Structure

This paper contains 19 sections, 3 theorems, 20 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Under Assumptions assumption.lipschitz, assumption.lowerbound, assumption.unbiased, when the learning rate $\eta$ satisfies the following condition: the expected averaged squared gradients of ${\mathcal{L}}$ over $T=QR>0$ local iterations satisfies the following bound:

Figures (9)

  • Figure 1: (a) System architecture (b) An example of client models and the inputs and outputs to the system. The figure shows a client $k$ at each of the two silos and the concatenation of their output embedding to find the target $\tilde{y}$ corresponding to a single data sample $p$.
  • Figure 2: Illustration of data partitioning among hubs and clients in silos. Each silo $j$ owns a vertical partition of the full dataset $\mathbf{X}$. The features of each sample $i$ are distributed among clients of different silos. Each client in each hub owns a subset of features of some sample IDs.
  • Figure 3: Performance of TDCD on CIFAR-10. We show the test accuracy score vs. training time units $t_Q$. $N=2$ in all four figures. $K=50$ in Figures (a) and (b), and $K=100$ in Figures (c) and (d). In Figures (a) and (c), we use a lower ratio of communication vs. computation latency with $t_{comm}=10$. In Figures (b) and (d), we use a higher ratio of communication vs. computation latency with $t_{comm}=100$. $t_{comp}=1$ in all figures.
  • Figure 4: Performance of TDCD on MIMIC-III. We show the test F1 score vs. training time units $t_Q$. $N=4$ in all four figures. $K=20$ in Figures (a) and (b) and $K=50$ in Figures (c) and (d). In Figures (a) and (c), we use a lower ratio of communication vs. computation latency with $t_{comm}=10$. In Figures (b) and (d), we use a higher ratio of communication vs. computation latency with $t_{comm}=100$. $t_{comp}=1$ in all figures.
  • Figure 5: Performance of TDCD on ModelNet40. We show the top-$5$ test accuracy score vs. training time units $t_Q$. $N=12$ in all four figures. $K=10$ in Figures (a) and (b), and $K=20$ in Figures (c) and (d). In Figures (a) and (c), we use a lower ratio of communication vs. computation latency with $t_{comm}=10$. In Figures (b) and (d), we use a higher ratio of communication vs. computation latency with $t_{comm}=100$. $t_{comp}=1$ in all figures.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Remark 1
  • Remark 2
  • Theorem 1
  • Remark 3
  • Remark 4
  • Remark 5
  • Lemma 1
  • Lemma 2