Table of Contents
Fetching ...

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Feng Niu, Benjamin Recht, Christopher Re, Stephen J. Wright

TL;DR

The paper introduces Hogwild!, a lock-free parallel SGD method that exploits sparsity to allow asynchronous updates in shared memory. It provides a theoretical framework showing near-linear speedups under mild sparsity conditions and bounded staleness, along with robust 1/k convergence via a backoff scheme. Empirically, Hogwild! outperforms locking-based approaches across sparse SVM, matrix completion, and graph-cut problems. The work highlights practical gains for multicore training and lays groundwork for contention-reducing extensions.

Abstract

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

TL;DR

The paper introduces Hogwild!, a lock-free parallel SGD method that exploits sparsity to allow asynchronous updates in shared memory. It provides a theoretical framework showing near-linear speedups under mild sparsity conditions and bounded staleness, along with robust 1/k convergence via a backoff scheme. Empirically, Hogwild! outperforms locking-based approaches across sparse SVM, matrix completion, and graph-cut problems. The work highlights practical gains for multicore training and lays groundwork for contention-reducing extensions.

Abstract

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.

Paper Structure

This paper contains 16 sections, 1 theorem, 65 equations, 5 figures, 1 algorithm.

Key Result

Proposition 4.1

Suppose in Algorithm alg:main-asynch that the lag between when a gradient is computed and when it is used in step $j$ --- namely, $j-k(j)$ --- is always less than or equal to $\tau$, and $\gamma$ is defined to be for some $\epsilon>0$ and $\vartheta\in(0,1)$. Define $D_0:=\|x_0 - x_\star\|^2$ and let $k$ be an integer satisfying Then after $k$ component updates of $x$, we have $\mathbb{E}[ f(x_k

Figures (5)

  • Figure 1: Example graphs induced by cost function. (a) A sparse SVM induces a hypergraph where each hyperedge corresponds to one example. (b) A matrix completion example induces a bipartite graph between the rows and columns with an edge between two nodes if an entry is revealed. (c) The induced hypergraph in a graph-cut problem is simply the graph whose cuts we aim to find.
  • Figure 2: Comparison of wall clock time across of Hogwild! and RR. Each algorithm is run for $20$ epochs and parallelized over 10 cores.
  • Figure 3: Total CPU time versus number of threads for (a) RCV1, (b) Abdomen, and (c) DBLife.
  • Figure 4: Total CPU time versus number of threads for the matrix completion problems (a) Netflix Prize, (b) KDD Cup 2011, and (c) the synthetic Jumbo experiment.
  • Figure 5: (a) Speedup for the three matrix completion problems with Hogwild!. In all three cases, massive speedup is achieved via parallelism. (b) The training error at the end of each epoch of SVM training on RCV1 for the averaging algorithm Zinkevich10. (c) Speedup achieved over serial method for various levels of delays (measured in nanoseconds).

Theorems & Definitions (1)

  • Proposition 4.1