Table of Contents
Fetching ...

Using Large Language Model for End-to-End Chinese ASR and NER

Yuang Li, Jiawei Yu, Min Zhang, Mengxin Ren, Yanqing Zhao, Xiaofeng Zhao, Shimin Tao, Jinsong Su, Hao Yang

TL;DR

This work systematically compares decoder-only and encoder-decoder architectures for integrating speech into decoder-based LLMs, using a Whisper encoder and ChatGLM3 to tackle Chinese ASR and NER. It shows that encoder-decoder with cross-attention excels on short-context tasks, while decoder-only models leverage long-context better, reducing entity omissions and achieving a state-of-the-art AISHELL-NER F1 of 0.805 via a chain-of-thought NER approach. The authors introduce a three-phase training regime (short-form ASR, long-form ASR with history, and CoT NER) and a fine-grained NER error taxonomy to dissect performance, supported by attention/gate visualizations. The findings guide design choices for multimodal LLMs and highlight the potential of combining both architectures to maximize performance across tasks requiring both high-level semantic understanding and detailed acoustic processing.

Abstract

Mapping speech tokens to the same feature space as text tokens has become the paradigm for the integration of speech modality into decoder-only large language models (LLMs). An alternative approach is to use an encoder-decoder architecture that incorporates speech features through cross-attention. This approach, however, has received less attention in the literature. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks. We evaluate them not only by conventional metrics like the F1 score but also by a novel fine-grained taxonomy of ASR-NER errors. Our experiments reveal that encoder-decoder architecture outperforms decoder-only architecture with a short context, while decoder-only architecture benefits from a long context as it fully exploits all layers of the LLM. By using LLM, we significantly reduced the entity omission errors and improved the entity ASR accuracy compared to the Conformer baseline. Additionally, we obtained a state-of-the-art (SOTA) F1 score of 0.805 on the AISHELL-NER test set by using chain-of-thought (CoT) NER which first infers long-form ASR transcriptions and then predicts NER labels.

Using Large Language Model for End-to-End Chinese ASR and NER

TL;DR

This work systematically compares decoder-only and encoder-decoder architectures for integrating speech into decoder-based LLMs, using a Whisper encoder and ChatGLM3 to tackle Chinese ASR and NER. It shows that encoder-decoder with cross-attention excels on short-context tasks, while decoder-only models leverage long-context better, reducing entity omissions and achieving a state-of-the-art AISHELL-NER F1 of 0.805 via a chain-of-thought NER approach. The authors introduce a three-phase training regime (short-form ASR, long-form ASR with history, and CoT NER) and a fine-grained NER error taxonomy to dissect performance, supported by attention/gate visualizations. The findings guide design choices for multimodal LLMs and highlight the potential of combining both architectures to maximize performance across tasks requiring both high-level semantic understanding and detailed acoustic processing.

Abstract

Mapping speech tokens to the same feature space as text tokens has become the paradigm for the integration of speech modality into decoder-only large language models (LLMs). An alternative approach is to use an encoder-decoder architecture that incorporates speech features through cross-attention. This approach, however, has received less attention in the literature. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks. We evaluate them not only by conventional metrics like the F1 score but also by a novel fine-grained taxonomy of ASR-NER errors. Our experiments reveal that encoder-decoder architecture outperforms decoder-only architecture with a short context, while decoder-only architecture benefits from a long context as it fully exploits all layers of the LLM. By using LLM, we significantly reduced the entity omission errors and improved the entity ASR accuracy compared to the Conformer baseline. Additionally, we obtained a state-of-the-art (SOTA) F1 score of 0.805 on the AISHELL-NER test set by using chain-of-thought (CoT) NER which first infers long-form ASR transcriptions and then predicts NER labels.
Paper Structure (16 sections, 2 equations, 2 figures, 4 tables)

This paper contains 16 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The speech modality is incorporated into the LLM through (a) an adapter (decoder-only), and (b) cross-attention layers (encoder-decoder).
  • Figure 2: (a) Gate values of the cross-attention across different layers for the encoder-decoder architecture during different training phases. (b) The attention scores correspond to the speech tokens across different layers for the decoder-only architecture with different historical (His.) context lengths and tasks (i.e., ASR or NER).