Table of Contents
Fetching ...

Universal-2-TF: Robust All-Neural Text Formatting for ASR

Yash Khare, Taufiquzzaman Peyash, Andrea Vanzo, Takuya Yoshioka

TL;DR

The paper addresses robust post-ASR text formatting by proposing Universal-2-TF, a fully neural, two-stage TF system. It combines a shared-encoder multi-objective token classifier for PR and truecasing with a seq2seq span converter for ITN and mixed-case transformation, restricted to identified spans to control cost and reduce hallucinations, and it optimizes the joint objective $\mathcal{L} = \alpha_1 \mathcal{L}_1 + \alpha_2 \mathcal{L}_2 + \alpha_3 \mathcal{L}_3$. The authors enrich training data with careful cleaning and LLM-based augmentation, achieving high TF accuracy and perceptual quality while reducing inference time compared to end-to-end seq2seq or WFST-based baselines. They validate the approach on diverse datasets and show strong objective and subjective results, including improvements in ITN handling. The work advances practical commercial ASR deployment by delivering a robust, efficient, all-neural TF solution with broad domain generalization. $\mathcal{L}$ is minimized across tasks to jointly learn punctuation, capitalization, and ITN span handling, while a seq2seq module converts spans to polished written forms.

Abstract

This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.

Universal-2-TF: Robust All-Neural Text Formatting for ASR

TL;DR

The paper addresses robust post-ASR text formatting by proposing Universal-2-TF, a fully neural, two-stage TF system. It combines a shared-encoder multi-objective token classifier for PR and truecasing with a seq2seq span converter for ITN and mixed-case transformation, restricted to identified spans to control cost and reduce hallucinations, and it optimizes the joint objective . The authors enrich training data with careful cleaning and LLM-based augmentation, achieving high TF accuracy and perceptual quality while reducing inference time compared to end-to-end seq2seq or WFST-based baselines. They validate the approach on diverse datasets and show strong objective and subjective results, including improvements in ITN handling. The work advances practical commercial ASR deployment by delivering a robust, efficient, all-neural TF solution with broad domain generalization. is minimized across tasks to jointly learn punctuation, capitalization, and ITN span handling, while a seq2seq module converts spans to polished written forms.

Abstract

This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.
Paper Structure (19 sections, 3 equations, 2 figures, 7 tables)

This paper contains 19 sections, 3 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Universal-2-TF model architecture: A Transformer-based encoder generates token representations of the input text, which are processed by punctuation, truecasing, and ITN span identification heads. Punctuation and truecasing predictions are applied to the input text, from which ITN spans and mixed-case words are extracted along with limited left and right context (one word in this diagram). The identified spans are then processed by a seq2seq model for conversion and reintegrated into the original text.
  • Figure 2: Text formatting examples comparing Universal-2-TF (proposed model) with Universal-1-TF (previous model).