Universal-2-TF: Robust All-Neural Text Formatting for ASR
Yash Khare, Taufiquzzaman Peyash, Andrea Vanzo, Takuya Yoshioka
TL;DR
The paper addresses robust post-ASR text formatting by proposing Universal-2-TF, a fully neural, two-stage TF system. It combines a shared-encoder multi-objective token classifier for PR and truecasing with a seq2seq span converter for ITN and mixed-case transformation, restricted to identified spans to control cost and reduce hallucinations, and it optimizes the joint objective $\mathcal{L} = \alpha_1 \mathcal{L}_1 + \alpha_2 \mathcal{L}_2 + \alpha_3 \mathcal{L}_3$. The authors enrich training data with careful cleaning and LLM-based augmentation, achieving high TF accuracy and perceptual quality while reducing inference time compared to end-to-end seq2seq or WFST-based baselines. They validate the approach on diverse datasets and show strong objective and subjective results, including improvements in ITN handling. The work advances practical commercial ASR deployment by delivering a robust, efficient, all-neural TF solution with broad domain generalization. $\mathcal{L}$ is minimized across tasks to jointly learn punctuation, capitalization, and ITN span handling, while a seq2seq module converts spans to polished written forms.
Abstract
This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.
