Table of Contents
Fetching ...

Parallel training of DNNs with Natural Gradient and Parameter Averaging

Daniel Povey, Xiaohui Zhang, Sanjeev Khudanpur

TL;DR

The paper tackles scalable training of DNNs for speech recognition by combining periodic parameter averaging across multiple machines with an efficient NG-SGD algorithm. By approximating the inverse Fisher information matrix through a Kronecker-factorized, low-rank representation, the method preserves effective learning directions while maintaining computational tractability. Empirical results on Fisher English demonstrate improved convergence and robust WER across varying numbers of parallel jobs, with linear speedups up to a modest scale and diminishing gains beyond that. The work provides a practical framework for large-scale, hardware-agnostic neural-network training and highlights the benefits of natural-gradient-based updates for stability and efficiency.

Abstract

We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.

Parallel training of DNNs with Natural Gradient and Parameter Averaging

TL;DR

The paper tackles scalable training of DNNs for speech recognition by combining periodic parameter averaging across multiple machines with an efficient NG-SGD algorithm. By approximating the inverse Fisher information matrix through a Kronecker-factorized, low-rank representation, the method preserves effective learning directions while maintaining computational tractability. Empirical results on Fisher English demonstrate improved convergence and robust WER across varying numbers of parallel jobs, with linear speedups up to a modest scale and diminishing gains beyond that. The work provides a practical framework for large-scale, hardware-agnostic neural-network training and highlights the benefits of natural-gradient-based updates for stability and efficiency.

Abstract

We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.

Paper Structure

This paper contains 51 sections, 66 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Convergence of training objective function (log-probability)