Table of Contents
Fetching ...

DiPaCo: Distributed Path Composition

Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Adhiguna Kuncoro, Yani Donchev, Rachita Chhaparia, Ionel Gog, Marc'Aurelio Ranzato, Jiajun Shen, Arthur Szlam

TL;DR

This work proposes a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo), which exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.

Abstract

Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions. At inference time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that, for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.

DiPaCo: Distributed Path Composition

TL;DR

This work proposes a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo), which exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.

Abstract

Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions. At inference time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that, for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.
Paper Structure (46 sections, 6 equations, 11 figures, 5 tables)

This paper contains 46 sections, 6 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Long-term Goal: Ultimately, we envision a modular network where different components, paths$\pi_i$, are optimized for different tasks, $\mathcal{D}_j$, each designed by different researchers. The paths, trained on any available hardware type, communicate infrequently across the world, exchanging useful information and enabling new forms of composition.
  • Figure 2: An illustration of the first example from Section \ref{['sec:notation']}. A $4$ layer neural network, with block $B_1$ consisting of the first $2$ layers and $B_2$ consisting of the next $2$ layers. Each block has $3$ choices of module (each with its own parameters), represented by different colors. On the left, we show all of the $9$ possible paths. On the right, we show a single path.
  • Figure 3: routing More Frequently at Test-Time: At training time (left panel), the router selects the path $\pi_i$ using the prefix $z$. We train the chosen path on the whole sequence using the usual language modeling loss. At test time (right panel), the path selected by the router given the prefix is used to score the next chunk of tokens. Then, we re-use the router to choose the most likely path given the new chunk. This process repeats until the whole sequence has been scored.
  • Figure 4: DiPaCo: (left) The dataset is pre-sharded into $k$ shards, $\mathcal{D}_i$ (here $k=4$). (middle) Compact view of a $2\times2$ DiPaCo, which is never instantiated. In this toy illustration, there are three levels. Level 2 and 3 have a mixture with two modules each. Level 1 has a single module shared by all paths. (right) We associate each shard $\mathcal{D}_i$ to a path$\pi_i,\, \forall i \in [1, 4]$. In this toy illustration, a path is the composition of three neural network blocks. The color refers to the id of a module. The figure shows the modular network unrolled across the four paths. These are trained by using DiLoCo which requires communicating only every few hundred steps. In this example, module 2a (in red) is shared by paths $\pi_1$ and $\pi_2$. Workers associated to paths might use different hardware types (different kind of GPUs or TPUs) and might be placed in far away geographic areas.
  • Figure 5: DiPaCo with more capacity: In this example, level 3 modules are path specific, i.e., modules at that level are not shared by paths.
  • ...and 6 more figures