Table of Contents
Fetching ...

EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

Jianzong Wang, Ziqi Liang, Xulong Zhang, Ning Cheng, Jing Xiao

TL;DR

EfficientASR addresses the resource intensity of Transformer-based ASR by introducing SRMHA to share attention scores and updated attention with SWD to reduce redundancy, plus CFFN to compress FFN parameters. It combines cross-layer residual fusion with diagonalized attention and chunked FFNs to achieve significant parameter reductions while preserving accuracy, as demonstrated on AIShell-1 and HKUST (including Conformer_EfficientASR variants). The approach also yields memory benefits for long sequences, suggesting practical deployment on resource-limited devices. Overall, this work provides a concrete, experimentally validated path toward efficient, deployment-ready ASR transformers without sacrificing substantial accuracy.

Abstract

In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3% and 0.2% in Character Error Rate (CER) on the Aishell-1 and HKUST datasets, respectively.

EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

TL;DR

EfficientASR addresses the resource intensity of Transformer-based ASR by introducing SRMHA to share attention scores and updated attention with SWD to reduce redundancy, plus CFFN to compress FFN parameters. It combines cross-layer residual fusion with diagonalized attention and chunked FFNs to achieve significant parameter reductions while preserving accuracy, as demonstrated on AIShell-1 and HKUST (including Conformer_EfficientASR variants). The approach also yields memory benefits for long sequences, suggesting practical deployment on resource-limited devices. Overall, this work provides a concrete, experimentally validated path toward efficient, deployment-ready ASR transformers without sacrificing substantial accuracy.

Abstract

In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3% and 0.2% in Character Error Rate (CER) on the Aishell-1 and HKUST datasets, respectively.
Paper Structure (19 sections, 12 equations, 2 figures, 5 tables)

This paper contains 19 sections, 12 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: (a) is a traditional Transformer structure. (b) is an EfficientASR structure, where the dashed line portion is only used in shared attention mode. The small circles represent the embedding dimension of the input features, and the small squares represent learnable parameters. SWD represents the sliding window with deformability
  • Figure 2: Different models' memory usage for processing long sequences on HKUST dataset.