Automatic Speech Recognition with BERT and CTC Transformers: A Review
Noussaiba Djeffal, Hamza Kheddar, Djamel Addou, Ahmed Cherif Mazari, Yassine Himeur
TL;DR
This survey analyzes the use of BERT-based and CTC-based transformers for automatic speech recognition, identifying core ASR challenges and how transformer architectures address them. It covers a broad spectrum of models and variants, including HuBERT, Speech-BERT, NorBERT, and related language-model adaptations, as well as diverse CTC-based approaches such as Mask CTC and non-autoregressive CTC, summarizing methodologies, datasets, metrics, and results. The review discusses strengths, limitations, and practical considerations like decoding speed and resource demands, while highlighting opportunities for language coverage and robustness. It also proposes future directions, notably the potential integration of ChatGPT with BERT and CTC frameworks to enhance contextual understanding and generation in ASR systems. Overall, the paper provides a consolidated view of progress and a roadmap for advancing ASR with BERT and CTC transformers.
Abstract
This review paper provides a comprehensive analysis of recent advances in automatic speech recognition (ASR) with bidirectional encoder representations from transformers BERT and connectionist temporal classification (CTC) transformers. The paper first introduces the fundamental concepts of ASR and discusses the challenges associated with it. It then explains the architecture of BERT and CTC transformers and their potential applications in ASR. The paper reviews several studies that have used these models for speech recognition tasks and discusses the results obtained. Additionally, the paper highlights the limitations of these models and outlines potential areas for further research. All in all, this review provides valuable insights for researchers and practitioners who are interested in ASR with BERT and CTC transformers.
