Table of Contents
Fetching ...

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Adam Stooke, Rohit Prabhavalkar, Khe Chai Sim, Pedro Moreno Mengibar

TL;DR

This work shows that transformer-based encoders can implicitly learn audio-to-text alignment during forward computation, enabling an Aligner-Encoder that pairs a lightweight autoregressive decoder with a one-to-one encoder–decoder mapping. Training uses a frame-wise cross-entropy loss without dynamic programming or full cross-attention, dramatically reducing decoding complexity and improving efficiency. Empirical results across LibriSpeech, Voice Search, and YouTube long-form data demonstrate competitive performance close to RNN-T and often surpassing AED or CTC, with notable gains in inference speed and scalability for long sequences. Alignment behavior is visualized through self-attention and embedding analyses, revealing that the encoder can effectively self-transduce and even handle reverse alignments, suggesting potential extensions to non-monotonic tasks like machine translation or speech translation.

Abstract

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the "Aligner-Encoder". To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention -- it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform "self-transduction".

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

TL;DR

This work shows that transformer-based encoders can implicitly learn audio-to-text alignment during forward computation, enabling an Aligner-Encoder that pairs a lightweight autoregressive decoder with a one-to-one encoder–decoder mapping. Training uses a frame-wise cross-entropy loss without dynamic programming or full cross-attention, dramatically reducing decoding complexity and improving efficiency. Empirical results across LibriSpeech, Voice Search, and YouTube long-form data demonstrate competitive performance close to RNN-T and often surpassing AED or CTC, with notable gains in inference speed and scalability for long sequences. Alignment behavior is visualized through self-attention and embedding analyses, revealing that the encoder can effectively self-transduce and even handle reverse alignments, suggesting potential extensions to non-monotonic tasks like machine translation or speech translation.

Abstract

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the "Aligner-Encoder". To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention -- it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform "self-transduction".

Paper Structure

This paper contains 21 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Information flow through an Aligner-Encoder versus traditional audio encoders.
  • Figure 2: Self-attention probabilities from a single head at different layers in a 17-layer Aligner-Encoder performing audio-to-text alignment.
  • Figure 3: Decoding lattice probabilities ($U$ vs $T$) from RNN-T-on-Aligner and self-attention weights exhibiting successful alignment, within a specific layer.
  • Figure 4: Decoding lattice probabilities ($U$ vs $T$) from RNN-T-on-Aligner and self-attention weights exhibiting a failure mode for an utterance 1.5x longer than trained.
  • Figure 5: Self-attention weights in an Aligner-Encoder (from a single head) trained on reversed audio; the reverse alignment is clearly visible in layers 15 and 16. (LibriSpeech, 17-layer encoder).