The Conformer Encoder May Reverse the Time Dimension
Robin Schmitt, Albert Zeyer, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney
TL;DR
The paper investigates an unexpected time-dimension reversal in Conformer-based encoder–decoder models used for automatic speech recognition. It analyzes how cross-attention and late-stage self-attention dynamics can flip the encoder's time ordering, and it demonstrates that this behavior emerges in Conformer block ${10}$ due to dominant self-attention and weakened residual influence. The authors propose practical mitigation strategies, including employing a CTC auxiliary loss, delaying self-attention, or forcing center-frame attention, which stabilize training and prevent flipping. Additionally, they introduce a gradient-based method to derive label–frame alignments from the encoder inputs, showing competitive time-stamp accuracy and robust alignments even when flipping occurs. The work offers actionable guidance for training robust Conformer AED models and provides a novel tool for alignment extraction in ASR.
Abstract
We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models, Further investigation shows that the Conformer encoder reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose methods and ideas of how this flipping can be avoided and investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.
