Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Jiaqing Liu; Chong Deng; Qinglin Zhang; Shilin Zhou; Qian Chen; Hai Yu; Wen Wang

Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Jiaqing Liu, Chong Deng, Qinglin Zhang, Shilin Zhou, Qian Chen, Hai Yu, Wen Wang

TL;DR

The paper introduces CoS2W, a contextualized Spoken-to-Written conversion task that aims to produce readable, formal text from verbatim ASR transcripts by correcting errors and adapting informal speech to written style while preserving meaning. It sites the SWAB benchmark as a document-level, multilingual dataset (Chinese and English) across meetings, podcasts, and lectures, augmented with auxiliary information to leverage context. The study evaluates multiple LLMs across granularity levels and context/auxiliary-information strategies, showing that larger models like GPT-4 perform best but still face faithfulness challenges, with chunk-level and local-context setups generally yielding better fidelity and formality than document-level. The work demonstrates the potential of LLMs for CoS2W while highlighting evaluation reliability through LLM-based scoring, and it outlines future directions, including dataset expansion and improved evaluation methods to better capture content preservation and formal writing quality.

Abstract

Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and various spoken language phenomena such as disfluencies, ungrammatical sentences, and incomplete sentences, hence suffering from poor readability. To improve readability, we propose a Contextualized Spoken-to-Written conversion (CoS2W) task to address ASR and grammar errors and also transfer the informal text into the formal style with content preserved, utilizing contexts and auxiliary information. This task naturally matches the in-context learning capabilities of Large Language Models (LLMs). To facilitate comprehensive comparisons of various LLMs, we construct a document-level Spoken-to-Written conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to enhance the outputs. Experimental results reveal that LLMs have the potential to excel in the CoS2W task, particularly in grammaticality and formality, our methods achieve effective understanding of contexts and auxiliary information by LLMs. We further investigate the effectiveness of using LLMs as evaluators and find that LLM evaluators show strong correlations with human evaluations on rankings of faithfulness and formality, which validates the reliability of LLM evaluators for the CoS2W task.

Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

TL;DR

Abstract

Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Authors

TL;DR

Abstract

Table of Contents

Figures (3)