Table of Contents
Fetching ...

What happens when nanochat meets DiLoCo?

Alexander Acker, Soeren Becker, Sasho Nedelkoski, Dominik Scheinert, Odej Kao, Philipp Wiesner

TL;DR

This study investigates the viability of low-communication distributed training for LLMs by integrating the DiLoCo algorithm with a compact nanochat baseline. It shows that DiLoCo can achieve stable pretraining and substantial communication savings, but induces irreversible representational drift that degrades downstream alignment during mid-training and supervised fine-tuning. The results reveal a critical trade-off between training efficiency and downstream task fidelity, with a persistent drift that even re-initiating DDP cannot recover. The work provides a reproducible end-to-end baseline and highlights the need for drift-aware synchronization strategies to make decentralized LLM training practical for downstream tasks.

Abstract

Although LLM training is typically centralized with high-bandwidth interconnects and large compute budgets, emerging methods target communication-constrained training in distributed environments. The model trade-offs introduced by this shift remain underexplored, and our goal is to study them. We use the open-source nanochat project, a compact 8K-line full-stack ChatGPT-like implementation containing tokenization, pretraining, fine-tuning, and serving, as a controlled baseline. We implement the DiLoCo algorithm as a lightweight wrapper over nanochat's training loop, performing multiple local steps per worker before synchronization with an outer optimizer, effectively reducing communication by orders of magnitude. This inner-outer training is compared against a standard data-parallel (DDP) setup. Because nanochat is small and inspectable, it enables controlled pipeline adaptations and allows direct comparison with the conventional centralized baseline. DiLoCo achieves stable convergence and competitive loss in pretraining but yields worse MMLU, GSM8K, and HumanEval scores after mid-training and SFT. We discover that using DiLoCo-pretrained weights and running mid- and post-training with DDP fails to recover performance, revealing irreversible representation drift from asynchronous updates that impairs downstream alignment. We provide this implementation as an official fork of nanochat on GitHub.

What happens when nanochat meets DiLoCo?

TL;DR

This study investigates the viability of low-communication distributed training for LLMs by integrating the DiLoCo algorithm with a compact nanochat baseline. It shows that DiLoCo can achieve stable pretraining and substantial communication savings, but induces irreversible representational drift that degrades downstream alignment during mid-training and supervised fine-tuning. The results reveal a critical trade-off between training efficiency and downstream task fidelity, with a persistent drift that even re-initiating DDP cannot recover. The work provides a reproducible end-to-end baseline and highlights the need for drift-aware synchronization strategies to make decentralized LLM training practical for downstream tasks.

Abstract

Although LLM training is typically centralized with high-bandwidth interconnects and large compute budgets, emerging methods target communication-constrained training in distributed environments. The model trade-offs introduced by this shift remain underexplored, and our goal is to study them. We use the open-source nanochat project, a compact 8K-line full-stack ChatGPT-like implementation containing tokenization, pretraining, fine-tuning, and serving, as a controlled baseline. We implement the DiLoCo algorithm as a lightweight wrapper over nanochat's training loop, performing multiple local steps per worker before synchronization with an outer optimizer, effectively reducing communication by orders of magnitude. This inner-outer training is compared against a standard data-parallel (DDP) setup. Because nanochat is small and inspectable, it enables controlled pipeline adaptations and allows direct comparison with the conventional centralized baseline. DiLoCo achieves stable convergence and competitive loss in pretraining but yields worse MMLU, GSM8K, and HumanEval scores after mid-training and SFT. We discover that using DiLoCo-pretrained weights and running mid- and post-training with DDP fails to recover performance, revealing irreversible representation drift from asynchronous updates that impairs downstream alignment. We provide this implementation as an official fork of nanochat on GitHub.

Paper Structure

This paper contains 18 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Training loss comparison during the base pretraining stage. DiLoCo (red) achieves stable convergence compared to the Standard (blue) baseline.
  • Figure 2: Mid-training loss comparison. DiLoCo (red) and Hybrid (green) models fail to minimize the dialogue-style objective relative to the Standard (blue) baseline.
  • Figure 3: SFT loss comparison. Both DiLoCo (red) and Hybrid (green) exhibit high loss floors, consistent with the quantitative collapse in Table \ref{['tab:joint_results_updated']}.