Table of Contents
Fetching ...

NeKo: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model

Yen-Ting Lin, Zhehuai Chen, Piotr Zelasko, Zhen Wan, Xuesong Yang, Zih-Ching Chen, Krishna C Puvvada, Szu-Wei Fu, Ke Hu, Jun Wei Chiu, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Chao-Han Huck Yang

TL;DR

NeKo introduces a task-guided Mixture-of-Experts model that unifies post-recognition error correction across ASR, ST/MT, OCR, and TEC within a single framework. By assigning dedicated experts to specific tasks and routing tokens through a shared gating mechanism, NeKo achieves state-of-the-art WER reductions, substantial BLEU gains, and robust zero-shot performance on diverse datasets. The approach demonstrates strong cross-domain generalization and competitive multi-task correction capabilities, with evidence of emergent cross-task benefits. The work highlights the potential of MoE-driven, cross-modal post-recognition correction and provides open-source groundwork for reproducibility and further research.

Abstract

Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an ``expert'' of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset's tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative 5.0% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with 15.5% to 27.6% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.

NeKo: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model

TL;DR

NeKo introduces a task-guided Mixture-of-Experts model that unifies post-recognition error correction across ASR, ST/MT, OCR, and TEC within a single framework. By assigning dedicated experts to specific tasks and routing tokens through a shared gating mechanism, NeKo achieves state-of-the-art WER reductions, substantial BLEU gains, and robust zero-shot performance on diverse datasets. The approach demonstrates strong cross-domain generalization and competitive multi-task correction capabilities, with evidence of emergent cross-task benefits. The work highlights the potential of MoE-driven, cross-modal post-recognition correction and provides open-source groundwork for reproducibility and further research.

Abstract

Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an ``expert'' of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset's tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative 5.0% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with 15.5% to 27.6% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.

Paper Structure

This paper contains 36 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Proposed NeKo, a new form multi-task model to boost post-recognition results over speech, text, and visual inputs. NeKo could work for (i) post automatic speech recognition (ASR) correction, (ii) post speech translation (ST) and machine translation (MT) correction, and (iii) post optical character recognition (OCR) correction. NeKo discover new state-of-the-art results in (iv) zero-shot ASR correction and performs competitively as a general-purpose (v) multi-task corrector.
  • Figure 2: The architecture of our proposed model, NeKo, which integrates MoE layers within a Transformer architecture. During inference, we do not assume knowledge of the specific task an input belongs to and each token is routed to the top-$2$ experts solely based on their router probabilities.
  • Figure 3: Example prompts of various correction tasks using Automatic Speech Recognition (ASR), Machine Translation (MT), Speech Translation (ST), Optical Character Recognition (OCR), and Textual Error Correction (TEC).
  • Figure 4: Examples of NeKo outputs for asr error correction task in SPGISpeech oneill21_kensho.
  • Figure 5: Examples of NeKo outputs for speech translation correction task in FLEURS DBLP:conf/slt/ConneauMKZADRRB22.
  • ...and 5 more figures