Table of Contents
Fetching ...

Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks

Ori Shapira, Shlomo E. Chazan, Amir DN Cohen

TL;DR

The paper introduces ENDow, a configurable framework to study how transcription noise from ASR affects downstream SLU tasks. It systematically varies noise intensity and type, applies targeted transcript cleaning, and evaluates three tasks (summarization, QA, dialog-act classification) across four LLMs. Key findings show that task performance attenuates differently with noise and that cleaning certain word-types, especially named entities and nouns, can yield meaningful gains with modest effort. The framework enables robust cross-model and cross-task comparisons, guiding practical SLU system design under noisy transcripts and informing where to allocate transcription-improvement resources.

Abstract

With the increasing prevalence of recorded human speech, spoken language understanding (SLU) is essential for its efficient processing. In order to process the speech, it is commonly transcribed using automatic speech recognition technology. This speech-to-text transition introduces errors into the transcripts, which subsequently propagate to downstream NLP tasks, such as dialogue summarization. While it is known that transcript noise affects downstream tasks, a systematic approach to analyzing its effects across different noise severities and types has not been addressed. We propose a configurable framework for assessing task models in diverse noisy settings, and for examining the impact of transcript-cleaning techniques. The framework facilitates the investigation of task model behavior, which can in turn support the development of effective SLU solutions. We exemplify the utility of our framework on three SLU tasks and four task models, offering insights regarding the effect of transcript noise on tasks in general and models in particular. For instance, we find that task models can tolerate a certain level of noise, and are affected differently by the types of errors in the transcript.

Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks

TL;DR

The paper introduces ENDow, a configurable framework to study how transcription noise from ASR affects downstream SLU tasks. It systematically varies noise intensity and type, applies targeted transcript cleaning, and evaluates three tasks (summarization, QA, dialog-act classification) across four LLMs. Key findings show that task performance attenuates differently with noise and that cleaning certain word-types, especially named entities and nouns, can yield meaningful gains with modest effort. The framework enables robust cross-model and cross-task comparisons, guiding practical SLU system design under noisy transcripts and informing where to allocate transcription-improvement resources.

Abstract

With the increasing prevalence of recorded human speech, spoken language understanding (SLU) is essential for its efficient processing. In order to process the speech, it is commonly transcribed using automatic speech recognition technology. This speech-to-text transition introduces errors into the transcripts, which subsequently propagate to downstream NLP tasks, such as dialogue summarization. While it is known that transcript noise affects downstream tasks, a systematic approach to analyzing its effects across different noise severities and types has not been addressed. We propose a configurable framework for assessing task models in diverse noisy settings, and for examining the impact of transcript-cleaning techniques. The framework facilitates the investigation of task model behavior, which can in turn support the development of effective SLU solutions. We exemplify the utility of our framework on three SLU tasks and four task models, offering insights regarding the effect of transcript noise on tasks in general and models in particular. For instance, we find that task models can tolerate a certain level of noise, and are affected differently by the types of errors in the transcript.

Paper Structure

This paper contains 51 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Speech can be transcribed with varying levels of error severity, which effects the results of downstream language understanding tasks. For example, summarizing a transcript with variations of the utterance above might produce differing outcomes. The top version is the reference, and the following are marked with errors.
  • Figure 2: The pipeline of the ENDow framework which yields a downstream task score and a WER score of the transcript set input to the task. The pipeline is executed for several severeties of noising and types of cleaning techniques. Resulting scores are plotted on a graph for the analyses, as in, e.g., \ref{['fig_cleaning_graphs']}.
  • Figure 3: Model performance on the experimented tasks. Curves are compared with area-under-the-curve (AUC) and noise-toleration points (NTP; marked with black dots). NTP marks the WER value where the task-score first decreases significantly from the score at $\text{WER}=0$. A line's shaded area represents its confidence interval. Graphs for the rest of the metrics are in \ref{['fig_noclean_graphs_all']}.
  • Figure 4: The performance of GPT-4o-mini when applying various cleaning techniques. Compare a point on the "no_cleaning" curve to the respective point on a cleaning technique's curve. Effective cleaning means maximizing gain in task score (y-axis) with minimum effort (x-axis), measured using the cleaning-effectiveness score (CES). Additional CES scores are in \ref{['tab_scores_cleaning']}, and more graphs are in Figures \ref{['fig_cleaning_graphs_all_qmsum']}, \ref{['fig_cleaning_graphs_all_qaconv']} and \ref{['fig_cleaning_graphs_all_mrda']} in the Appendix.
  • Figure 5: An illustration of the graph generated with the framework, for visual reference.
  • ...and 7 more figures