Table of Contents
Fetching ...

Adaptive Dropout for Pruning Conformers

Yotaro Kubo, Xingyu Cai, Michiel Bacchiani

TL;DR

The paper tackles overparameterization in Conformer-based speech encoders by integrating trainable adaptive dropout layers with unit-wise retention probabilities learned through a Gumbel-Sigmoid reparameterization ($m_d \sim \text{Bern}(\varsigma(\beta_d))$). It introduces an uncentered L2 regularization with a decreasing target $c(t)$ to gradually prune units during training, avoiding premature hard decisions. ADLs are embedded in Conformer blocks at three locations (FFN, MHSA, and LConv) to prune a large fraction of parameters while preserving or improving accuracy on LibriSpeech. Empirical results show about a 54% parameter reduction with roughly 1% absolute WER improvement, demonstrating a practical route to compact, efficient speech encoders without post-pruning fine-tuning.

Abstract

This paper proposes a method to effectively perform joint training-and-pruning based on adaptive dropout layers with unit-wise retention probabilities. The proposed method is based on the estimation of a unit-wise retention probability in a dropout layer. A unit that is estimated to have a small retention probability can be considered to be prunable. The retention probability of the unit is estimated using back-propagation and the Gumbel-Softmax technique. This pruning method is applied at several application points in Conformers such that the effective number of parameters can be significantly reduced. Specifically, adaptive dropout layers are introduced in three locations in each Conformer block: (a) the hidden layer of the feed-forward-net component, (b) the query vectors and the value vectors of the self-attention component, and (c) the input vectors of the LConv component. The proposed method is evaluated by conducting a speech recognition experiment on the LibriSpeech task. It was shown that this approach could simultaneously achieve a parameter reduction and accuracy improvement. The word error rates improved by approx 1% while reducing the number of parameters by 54%.

Adaptive Dropout for Pruning Conformers

TL;DR

The paper tackles overparameterization in Conformer-based speech encoders by integrating trainable adaptive dropout layers with unit-wise retention probabilities learned through a Gumbel-Sigmoid reparameterization (). It introduces an uncentered L2 regularization with a decreasing target to gradually prune units during training, avoiding premature hard decisions. ADLs are embedded in Conformer blocks at three locations (FFN, MHSA, and LConv) to prune a large fraction of parameters while preserving or improving accuracy on LibriSpeech. Empirical results show about a 54% parameter reduction with roughly 1% absolute WER improvement, demonstrating a practical route to compact, efficient speech encoders without post-pruning fine-tuning.

Abstract

This paper proposes a method to effectively perform joint training-and-pruning based on adaptive dropout layers with unit-wise retention probabilities. The proposed method is based on the estimation of a unit-wise retention probability in a dropout layer. A unit that is estimated to have a small retention probability can be considered to be prunable. The retention probability of the unit is estimated using back-propagation and the Gumbel-Softmax technique. This pruning method is applied at several application points in Conformers such that the effective number of parameters can be significantly reduced. Specifically, adaptive dropout layers are introduced in three locations in each Conformer block: (a) the hidden layer of the feed-forward-net component, (b) the query vectors and the value vectors of the self-attention component, and (c) the input vectors of the LConv component. The proposed method is evaluated by conducting a speech recognition experiment on the LibriSpeech task. It was shown that this approach could simultaneously achieve a parameter reduction and accuracy improvement. The word error rates improved by approx 1% while reducing the number of parameters by 54%.

Paper Structure

This paper contains 8 sections, 12 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The locations of masking component inserted in each submodule of the Conformer. The parametrized components in a Conformer block are shown as boxes with the variable names in them (the bias parameters are omitted for simplicity.) The blocks with gray background represent dimensionality-wise computations. The effects of masking propagate beyond those gray blocks and make it possible to prune the parameters next to it.
  • Figure 2: The ratio of surviving units for each different place where ADLs were inserted.