Table of Contents
Fetching ...

Non-Coherent Over-the-Air Decentralized Gradient Descent

Nicolo' Michelusi

TL;DR

A scalable DGD algorithm that eliminates the need for scheduling, topology information, or CSI (both average and instantaneous), and introduces a consensus stepsize that mitigates consensus estimation errors due to energy fluctuations around their expected values.

Abstract

Implementing Decentralized Gradient Descent (DGD) in wireless systems is challenging due to noise, fading, and limited bandwidth, necessitating topology awareness, transmission scheduling, and the acquisition of channel state information (CSI) to mitigate interference and maintain reliable communications. These operations may result in substantial signaling overhead and scalability challenges in large networks lacking central coordination. This paper introduces a scalable DGD algorithm that eliminates the need for scheduling, topology information, or CSI (both average and instantaneous). At its core is a Non-Coherent Over-The-Air (NCOTA) consensus scheme that exploits a noisy energy superposition property of wireless channels. Nodes encode their local optimization signals into energy levels within an OFDM frame and transmit simultaneously, without coordination. The key insight is that the received energy equals, on average, the sum of the energies of the transmitted signals, scaled by their respective average channel gains, akin to a consensus step. This property enables unbiased consensus estimation, utilizing average channel gains as mixing weights, thereby removing the need for their explicit design or for CSI. Introducing a consensus stepsize mitigates consensus estimation errors due to energy fluctuations around their expected values. For strongly-convex problems, it is shown that the expected squared distance between the local and globally optimum models vanishes at a rate of O(1/sqrt{k}) after k iterations, with suitable decreasing learning and consensus stepsizes. Extensions accommodate a broad class of fading models and frequency-selective channels. Numerical experiments on image classification demonstrate faster convergence in terms of running time compared to state-of-the-art schemes, especially in dense network scenarios.

Non-Coherent Over-the-Air Decentralized Gradient Descent

TL;DR

A scalable DGD algorithm that eliminates the need for scheduling, topology information, or CSI (both average and instantaneous), and introduces a consensus stepsize that mitigates consensus estimation errors due to energy fluctuations around their expected values.

Abstract

Implementing Decentralized Gradient Descent (DGD) in wireless systems is challenging due to noise, fading, and limited bandwidth, necessitating topology awareness, transmission scheduling, and the acquisition of channel state information (CSI) to mitigate interference and maintain reliable communications. These operations may result in substantial signaling overhead and scalability challenges in large networks lacking central coordination. This paper introduces a scalable DGD algorithm that eliminates the need for scheduling, topology information, or CSI (both average and instantaneous). At its core is a Non-Coherent Over-The-Air (NCOTA) consensus scheme that exploits a noisy energy superposition property of wireless channels. Nodes encode their local optimization signals into energy levels within an OFDM frame and transmit simultaneously, without coordination. The key insight is that the received energy equals, on average, the sum of the energies of the transmitted signals, scaled by their respective average channel gains, akin to a consensus step. This property enables unbiased consensus estimation, utilizing average channel gains as mixing weights, thereby removing the need for their explicit design or for CSI. Introducing a consensus stepsize mitigates consensus estimation errors due to energy fluctuations around their expected values. For strongly-convex problems, it is shown that the expected squared distance between the local and globally optimum models vanishes at a rate of O(1/sqrt{k}) after k iterations, with suitable decreasing learning and consensus stepsizes. Extensions accommodate a broad class of fading models and frequency-selective channels. Numerical experiments on image classification demonstrate faster convergence in terms of running time compared to state-of-the-art schemes, especially in dense network scenarios.
Paper Structure (27 sections, 7 theorems, 129 equations, 7 figures, 1 algorithm)

This paper contains 27 sections, 7 theorems, 129 equations, 7 figures, 1 algorithm.

Key Result

Lemma 1

The energy received on the $m$th subcarrier of node $i$, $|[\mathbf y_{i}]_m|^2$, satisfies with $\mathbb E$ computed with respect to the noise, Rayleigh fading, and random transmission decisions of nodes $j\neq i$.In this section, we implicitly assume that all expectations are conditional on $\{\mathbf w_j,\forall j\}$, hence on $\{\mathbf p_j,\forall j\}$.

Figures (7)

  • Figure 1: Example of $d{=}2$-dimensional problem with $N{=}4$ nodes. The circle represents the set $\mathcal{W}$, whereas the diamond represents the convex hull of $M{=}5$ codewords from the CP0 codebook (Example \ref{['ex1']}), $\mathbf z_1,\dots,\mathbf z_5$. In frame $k=4$, nodes A and D encode their local state $\mathbf w_i$ to a transmit signal $\mathbf x_i$ (Eq. \ref{['xM1']}), and transmit simultaneously; receiving nodes C and D estimate the disagreement signal $\tilde{\mathbf d}_i$ from their received signal $\mathbf y_i$ (Eqs. \ref{['yi']}-\ref{['dik']}).
  • Figure 2: Example of $M{=}5$-dimensional $\mathbf p$ mapped to a frame containing $O=3$ OFDM symbols, each with $\mathrm{SC}=8$ subcarriers, over 5 iterations. The color-coded sets $\mathcal{R}_{1},\dots,\mathcal{R}_5$ correspond to a partition of the $Q=3\times 8=24$ resource units (across subcarriers and OFDM symbols) allocated to a certain signal dimension. For instance, at iteration $1$, $m=1$ is allocated to $\mathcal{R}_1$, containing the resource units $\{1,6,11,16,21\}$. Frames #1 to #5 demonstrate the circular subcarrier shift (Sec. \ref{['broadclass']}). In this example, each signal dimension is mapped to all resource units across 5 iterations.
  • Figure 3: Example of spatially-dependent deployment scenario with $N=40$ nodes. Each node holds data from only one class, indicated by the label indices '0' to '9' (4 nodes per class). For instance, node $\star$ holds data from class '3'. With 3 reflectors, there are 4 signal paths, shown in the figure between a generic transmitter ($\star$) and receiver ($\blacktriangle$) pair.
  • Figure 4: Normalized error vs time, for different configurations of NCOTA-DGD and four different scenarios. The common legend is shown in the left figure, and shows the frame duration of each configuration. (a) Spatially-i.i.d. labels and i.i.d. channels; (b) Spatially-dependent labels and i.i.d. channels; (c) Spatially-i.i.d. labels and block-fading channels (2ms coherence time); (d) Spatially-i.i.d. labels and static channels.
  • Figure 5: Normalized error (a), suboptimality gap (b), test error (c) and number of iterations (d) vs number of nodes $N$, after 2000ms of execution time, under spatially-i.i.d. (solid lines) and -dependent (dashed lines with markers) label scenarios, with i.i.d. channels over frames. Common legend shown in figure (a).
  • ...and 2 more figures

Theorems & Definitions (20)

  • Definition 1: Laplacian matrix
  • Example 1: Cross-polytope-$\phi$ (CP$\phi$) codebook
  • Remark 1
  • Lemma 1: Energy superposition property
  • proof
  • Lemma 2
  • Definition 2: diameter of $\mathcal{W}$
  • Definition 3: gradient divergence
  • Theorem 3
  • proof
  • ...and 10 more