Table of Contents
Fetching ...

FireCaffe: near-linear acceleration of deep neural network training on compute clusters

Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, Kurt Keutzer

TL;DR

FireCaffe tackles the bottleneck of long DNN training times by distributing training across GPU clusters and aggressively reducing inter-node communication. It shows that data-parallel training with a reduction-tree allreduce, paired with high-bandwidth interconnects and careful batch-size/hyperparameter tuning, yields near-linear speedups, e.g., 39x for NiN and 47x for GoogLeNet on 128 GPUs. The work provides a rigorous methodology for fair scalability evaluation on ImageNet and demonstrates substantial practical impact by enabling much faster exploration of architectures and potential real-time training scenarios. The combination of architectural choices (fewer-parameter models), efficient communication, and hardware awareness forms a practical blueprint for scalable DNN training.

Abstract

Long training times for high-accuracy deep neural networks (DNNs) impede research into new DNN architectures and slow the development of high-accuracy DNNs. In this paper we present FireCaffe, which successfully scales deep neural network training across a cluster of GPUs. We also present a number of best practices to aid in comparing advancements in methods for scaling and accelerating the training of deep neural networks. The speed and scalability of distributed algorithms is almost always limited by the overhead of communicating between servers; DNN training is not an exception to this rule. Therefore, the key consideration here is to reduce communication overhead wherever possible, while not degrading the accuracy of the DNN models that we train. Our approach has three key pillars. First, we select network hardware that achieves high bandwidth between GPU servers -- Infiniband or Cray interconnects are ideal for this. Second, we consider a number of communication algorithms, and we find that reduction trees are more efficient and scalable than the traditional parameter server approach. Third, we optionally increase the batch size to reduce the total quantity of communication during DNN training, and we identify hyperparameters that allow us to reproduce the small-batch accuracy while training with large batch sizes. When training GoogLeNet and Network-in-Network on ImageNet, we achieve a 47x and 39x speedup, respectively, when training on a cluster of 128 GPUs.

FireCaffe: near-linear acceleration of deep neural network training on compute clusters

TL;DR

FireCaffe tackles the bottleneck of long DNN training times by distributing training across GPU clusters and aggressively reducing inter-node communication. It shows that data-parallel training with a reduction-tree allreduce, paired with high-bandwidth interconnects and careful batch-size/hyperparameter tuning, yields near-linear speedups, e.g., 39x for NiN and 47x for GoogLeNet on 128 GPUs. The work provides a rigorous methodology for fair scalability evaluation on ImageNet and demonstrates substantial practical impact by enabling much faster exploration of architectures and potential real-time training scenarios. The combination of architectural choices (fewer-parameter models), efficient communication, and hardware awareness forms a practical blueprint for scalable DNN training.

Abstract

Long training times for high-accuracy deep neural networks (DNNs) impede research into new DNN architectures and slow the development of high-accuracy DNNs. In this paper we present FireCaffe, which successfully scales deep neural network training across a cluster of GPUs. We also present a number of best practices to aid in comparing advancements in methods for scaling and accelerating the training of deep neural networks. The speed and scalability of distributed algorithms is almost always limited by the overhead of communicating between servers; DNN training is not an exception to this rule. Therefore, the key consideration here is to reduce communication overhead wherever possible, while not degrading the accuracy of the DNN models that we train. Our approach has three key pillars. First, we select network hardware that achieves high bandwidth between GPU servers -- Infiniband or Cray interconnects are ideal for this. Second, we consider a number of communication algorithms, and we find that reduction trees are more efficient and scalable than the traditional parameter server approach. Third, we optionally increase the batch size to reduce the total quantity of communication during DNN training, and we identify hyperparameters that allow us to reproduce the small-batch accuracy while training with large batch sizes. When training GoogLeNet and Network-in-Network on ImageNet, we achieve a 47x and 39x speedup, respectively, when training on a cluster of 128 GPUs.

Paper Structure

This paper contains 19 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Data parallel DNN training in FireCaffe: Each worker (GPU) gets a subset of each batch.
  • Figure 2: Deep neural network architectures with more parameters do not necessarily deliver higher accuracy.
  • Figure 3: Illustrating how parameter servers and reduction trees communicate weight gradients. In this figure, we only show the summing-up of weight gradients. We distribute the weight gradient sums by going back down the tree.
  • Figure 4: Comparing communication overhead with a parameter server vs. a reduction tree. This is for the Network-in-Network DNN architecture, so each GPU worker contributes 30MB of gradient updates.