Parallel training of DNNs with Natural Gradient and Parameter Averaging
Daniel Povey, Xiaohui Zhang, Sanjeev Khudanpur
TL;DR
The paper tackles scalable training of DNNs for speech recognition by combining periodic parameter averaging across multiple machines with an efficient NG-SGD algorithm. By approximating the inverse Fisher information matrix through a Kronecker-factorized, low-rank representation, the method preserves effective learning directions while maintaining computational tractability. Empirical results on Fisher English demonstrate improved convergence and robust WER across varying numbers of parallel jobs, with linear speedups up to a modest scale and diminishing gains beyond that. The work provides a practical framework for large-scale, hardware-agnostic neural-network training and highlights the benefits of natural-gradient-based updates for stability and efficiency.
Abstract
We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.
