Table of Contents
Fetching ...

An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement

Qiquan Zhang, Meng Ge, Hongxu Zhu, Eliathamby Ambikairajah, Qi Song, Zhaoheng Ni, Haizhou Li

TL;DR

The paper addresses the question of how positional encoding influences Transformer-based monaural speech enhancement.It empirically compares five encoding schemes, including Sinusoidal-APE, Learnable-APE, T5-RPE, KERPLE, and No-Pos, across causal and noncausal Transformers for spectral mapping (MS) and spectral masking (PSM).Key findings show that positional encoding provides limited benefit in causal configurations but yields substantial improvements in noncausal settings, with relative position embeddings outperforming absolute ones.This work provides practical guidance for designing Transformer-based speech enhancement systems and clarifies when and which PE schemes are advantageous.

Abstract

Transformer architecture has enabled recent progress in speech enhancement. Since Transformers are position-agostic, positional encoding is the de facto standard component used to enable Transformers to distinguish the order of elements in a sequence. However, it remains unclear how positional encoding exactly impacts speech enhancement based on Transformer architectures. In this paper, we perform a comprehensive empirical study evaluating five positional encoding methods, i.e., Sinusoidal and learned absolute position embedding (APE), T5-RPE, KERPLE, as well as the Transformer without positional encoding (No-Pos), across both causal and noncausal configurations. We conduct extensive speech enhancement experiments, involving spectral mapping and masking methods. Our findings establish that positional encoding is not quite helpful for the models in a causal configuration, which indicates that causal attention may implicitly incorporate position information. In a noncausal configuration, the models significantly benefit from the use of positional encoding. In addition, we find that among the four position embeddings, relative position embeddings outperform APEs.

An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement

TL;DR

The paper addresses the question of how positional encoding influences Transformer-based monaural speech enhancement.It empirically compares five encoding schemes, including Sinusoidal-APE, Learnable-APE, T5-RPE, KERPLE, and No-Pos, across causal and noncausal Transformers for spectral mapping (MS) and spectral masking (PSM).Key findings show that positional encoding provides limited benefit in causal configurations but yields substantial improvements in noncausal settings, with relative position embeddings outperforming absolute ones.This work provides practical guidance for designing Transformer-based speech enhancement systems and clarifies when and which PE schemes are advantageous.

Abstract

Transformer architecture has enabled recent progress in speech enhancement. Since Transformers are position-agostic, positional encoding is the de facto standard component used to enable Transformers to distinguish the order of elements in a sequence. However, it remains unclear how positional encoding exactly impacts speech enhancement based on Transformer architectures. In this paper, we perform a comprehensive empirical study evaluating five positional encoding methods, i.e., Sinusoidal and learned absolute position embedding (APE), T5-RPE, KERPLE, as well as the Transformer without positional encoding (No-Pos), across both causal and noncausal configurations. We conduct extensive speech enhancement experiments, involving spectral mapping and masking methods. Our findings establish that positional encoding is not quite helpful for the models in a causal configuration, which indicates that causal attention may implicitly incorporate position information. In a noncausal configuration, the models significantly benefit from the use of positional encoding. In addition, we find that among the four position embeddings, relative position embeddings outperform APEs.
Paper Structure (14 sections, 7 equations, 4 figures, 3 tables)

This paper contains 14 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of position-aware Transformer architecture for speech enhancement. $\oplus$ denotes the element-wise summation.
  • Figure 2: Illustration of (a) the causal self-attention and (b) the noncausal (full) self-attention (with sequence length $L=12$).
  • Figure 3: The training and validation loss in causal configuration.
  • Figure 4: The training and validation loss in noncausal configuration.