Table of Contents
Fetching ...

The Conformer Encoder May Reverse the Time Dimension

Robin Schmitt, Albert Zeyer, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney

TL;DR

The paper investigates an unexpected time-dimension reversal in Conformer-based encoder–decoder models used for automatic speech recognition. It analyzes how cross-attention and late-stage self-attention dynamics can flip the encoder's time ordering, and it demonstrates that this behavior emerges in Conformer block ${10}$ due to dominant self-attention and weakened residual influence. The authors propose practical mitigation strategies, including employing a CTC auxiliary loss, delaying self-attention, or forcing center-frame attention, which stabilize training and prevent flipping. Additionally, they introduce a gradient-based method to derive label–frame alignments from the encoder inputs, showing competitive time-stamp accuracy and robust alignments even when flipping occurs. The work offers actionable guidance for training robust Conformer AED models and provides a novel tool for alignment extraction in ASR.

Abstract

We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models, Further investigation shows that the Conformer encoder reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose methods and ideas of how this flipping can be avoided and investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.

The Conformer Encoder May Reverse the Time Dimension

TL;DR

The paper investigates an unexpected time-dimension reversal in Conformer-based encoder–decoder models used for automatic speech recognition. It analyzes how cross-attention and late-stage self-attention dynamics can flip the encoder's time ordering, and it demonstrates that this behavior emerges in Conformer block due to dominant self-attention and weakened residual influence. The authors propose practical mitigation strategies, including employing a CTC auxiliary loss, delaying self-attention, or forcing center-frame attention, which stabilize training and prevent flipping. Additionally, they introduce a gradient-based method to derive label–frame alignments from the encoder inputs, showing competitive time-stamp accuracy and robust alignments even when flipping occurs. The work offers actionable guidance for training robust Conformer AED models and provides a novel tool for alignment extraction in ASR.

Abstract

We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models, Further investigation shows that the Conformer encoder reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose methods and ideas of how this flipping can be avoided and investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.
Paper Structure (14 sections, 5 equations, 6 figures, 2 tables)

This paper contains 14 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Cross-attention weights of a model with reversed encoder vs. standard encoder.
  • Figure 2: Development of cross-attention weights into the flipping behavior over the initial training epochs.
  • Figure 3: Showing $G_0$, i.e. the logarithm of the $L_2$ norm of the gradients of the target label log probabilities w.r.t. first Conformer block inputs very early in training (after 2 epochs).
  • Figure 4: Self-attention energies averaged over the 8 heads of the 10th Conformer block for initial epochs. After this, all further layers are flipped.
  • Figure 5: Gradients $G_9$ and $G_{10}$ w.r.t. the output of blocks 9 and 10 after 12 epochs. For $G_9$, we have the crossing of information from the residual and the self-attention. In $G_{10}$, only the flipped information is left.
  • ...and 1 more figures