Table of Contents
Fetching ...

Speech Recognition Transformers: Topological-lingualism Perspective

Shruti Singh, Muskaan Singh, Virender Kadyan

TL;DR

This survey analyzes transformer-based ASR through a lingualism lens, addressing monolingual, bilingual, multilingual, and cross-lingual setups. It synthesizes background, datasets, acoustic features, architectures, decoding strategies, evaluation metrics, and popular toolkits, linking them to practical challenges like data scarcity and computational demands. Key contributions include a structured taxonomy, comprehensive cataloging of datasets and languages, and a critical discussion of open challenges and directions to advance ASR for low-resource languages. The work highlights the potential of transfer learning and self-supervised representations to extend high-resource language advances to diverse linguistic contexts, with implications for accessibility and real-world deployment. Overall, the paper provides a detailed roadmap for researchers and practitioners seeking to build scalable, robust, and multilingual ASR systems using transformer-based architectures.

Abstract

Transformers have evolved with great success in various artificial intelligence tasks. Thanks to our recent prevalence of self-attention mechanisms, which capture long-term dependency, phenomenal outcomes in speech processing and recognition tasks have been produced. The paper presents a comprehensive survey of transformer techniques oriented in speech modality. The main contents of this survey include (1) background of traditional ASR, end-to-end transformer ecosystem, and speech transformers (2) foundational models in a speech via lingualism paradigm, i.e., monolingual, bilingual, multilingual, and cross-lingual (3) dataset and languages, acoustic features, architecture, decoding, and evaluation metric from a specific topological lingualism perspective (4) popular speech transformer toolkit for building end-to-end ASR systems. Finally, highlight the discussion of open challenges and potential research directions for the community to conduct further research in this domain.

Speech Recognition Transformers: Topological-lingualism Perspective

TL;DR

This survey analyzes transformer-based ASR through a lingualism lens, addressing monolingual, bilingual, multilingual, and cross-lingual setups. It synthesizes background, datasets, acoustic features, architectures, decoding strategies, evaluation metrics, and popular toolkits, linking them to practical challenges like data scarcity and computational demands. Key contributions include a structured taxonomy, comprehensive cataloging of datasets and languages, and a critical discussion of open challenges and directions to advance ASR for low-resource languages. The work highlights the potential of transfer learning and self-supervised representations to extend high-resource language advances to diverse linguistic contexts, with implications for accessibility and real-world deployment. Overall, the paper provides a detailed roadmap for researchers and practitioners seeking to build scalable, robust, and multilingual ASR systems using transformer-based architectures.

Abstract

Transformers have evolved with great success in various artificial intelligence tasks. Thanks to our recent prevalence of self-attention mechanisms, which capture long-term dependency, phenomenal outcomes in speech processing and recognition tasks have been produced. The paper presents a comprehensive survey of transformer techniques oriented in speech modality. The main contents of this survey include (1) background of traditional ASR, end-to-end transformer ecosystem, and speech transformers (2) foundational models in a speech via lingualism paradigm, i.e., monolingual, bilingual, multilingual, and cross-lingual (3) dataset and languages, acoustic features, architecture, decoding, and evaluation metric from a specific topological lingualism perspective (4) popular speech transformer toolkit for building end-to-end ASR systems. Finally, highlight the discussion of open challenges and potential research directions for the community to conduct further research in this domain.
Paper Structure (32 sections, 3 equations, 12 figures, 10 tables)

This paper contains 32 sections, 3 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: (a) Audio/speech processing encoder-decoder architecturelatif2023transformers, vaswani2017attention, with stacked self-attention and feed-forward sublayers. (b) with multi-head attention containing parallel attention layers executed simultaneously (c) Scaled dot-product attention
  • Figure 2: Audio processing taxonomy based on lingualism
  • Figure 3: Presents our analysis for the usage of datasets for building/training the monolingual ASR system
  • Figure 4: Presents our analysis of the usage of languages for building/training the monolingual ASR system. Commonly explored languages in the development of monolingual ASR are English, French, Norwegian, Czech, Bengali, Finnish, and Mandarin
  • Figure 5: An analysis of architecture commonly used in monolingual ASR
  • ...and 7 more figures