A Multi-Token Coordinate Descent Method for Semi-Decentralized Vertical Federated Learning
Pedro Valdeira, Yuejie Chi, Cláudia Soares, João Xavier
TL;DR
This work tackles semi-decentralized vertical federated learning by introducing MTCD, a multi-token coordinate descent method that unifies client-server and decentralized schemes and enables a spectrum between them. The method extends STCD to multiple tokens that roam and periodically sync at a server, balancing parallel computation and information coupling to improve convergence and communication efficiency. Theoretical results establish $O(1/T)$ convergence for nonconvex objectives (and related rates under convexity) under large batch sizes, with analysis covering token overlap, clustering, and mini-batch variance. Empirically, MTCD outperforms fully decentralized and standard client-server baselines across convex problems and neural-network tasks, particularly when communication costs differ between client-client and client-server links, highlighting its practical impact for scalable VFL deployments.
Abstract
Most federated learning (FL) methods use a client-server scheme, where clients communicate only with a central server. However, this scheme is prone to bandwidth bottlenecks at the server and has a single point of failure. In contrast, in a (fully) decentralized approach, clients communicate directly with each other, dispensing with the server and mitigating these issues. Yet, as the client network grows larger and sparser, the convergence of decentralized methods slows down, even failing to converge if the network is disconnected. This work addresses this gap between client-server and decentralized schemes, focusing on the vertical FL setup, where clients hold different features of the same samples. We propose multi-token coordinate descent (MTCD), a flexible semi-decentralized method for vertical FL that can exploit both client-server and client-client links. By selecting appropriate hyperparameters, MTCD recovers the client-sever and decentralized schemes as special cases. In fact, its decentralized instance is itself a novel method of independent interest. Yet, by controlling the degree of dependency on client-server links, MTCD can also explore a spectrum of schemes ranging from client-server to decentralized. We prove that, for sufficiently large batch sizes, MTCD converges at an $\mathcal{O}(1/T)$ rate for nonconvex objectives when the tokens roam across disjoint subsets of clients. To capture the aforementioned drawbacks of the client-server scheme succinctly, we model the relative impact of using client-server versus client-client links as the ratio of their "costs", which depends on the application. This allows us to demonstrate, both analytically and empirically, that by tuning the degree of dependency on the server, the semi-decentralized instances of MTCD can outperform both client-server and decentralized approaches across a range of applications.
