Table of Contents
Fetching ...

Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition

Hsuan Su, Hua Farn, Fan-Yun Sun, Shang-Tse Chen, Hung-yi Lee

TL;DR

This paper finds that task arithmetic is effective at mitigating the synthetic-to-real gap and shows that an average of SYN2REAL task vectors, when the authors have real speeches from multiple different domains, can further adapt the original ASR model to perform better on the target text domain.

Abstract

Synthetic data is widely used in speech recognition due to the availability of text-to-speech models, which facilitate adapting models to previously unseen text domains. However, existing methods suffer in performance when they fine-tune an automatic speech recognition (ASR) model on synthetic data as they suffer from the distributional shift commonly referred to as the synthetic-to-real gap. In this paper, we find that task vector arithmetic is effective at mitigating this gap. Our proposed method, SYN2REAL task vector, shows an average improvement of 10.03\% improvement in word error rate over baselines on the SLURP dataset. Additionally, we show that an average of SYN2REAL task vectors, when we have real speeches from multiple different domains, can further adapt the original ASR model to perform better on the target text domain.

Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition

TL;DR

This paper finds that task arithmetic is effective at mitigating the synthetic-to-real gap and shows that an average of SYN2REAL task vectors, when the authors have real speeches from multiple different domains, can further adapt the original ASR model to perform better on the target text domain.

Abstract

Synthetic data is widely used in speech recognition due to the availability of text-to-speech models, which facilitate adapting models to previously unseen text domains. However, existing methods suffer in performance when they fine-tune an automatic speech recognition (ASR) model on synthetic data as they suffer from the distributional shift commonly referred to as the synthetic-to-real gap. In this paper, we find that task vector arithmetic is effective at mitigating this gap. Our proposed method, SYN2REAL task vector, shows an average improvement of 10.03\% improvement in word error rate over baselines on the SLURP dataset. Additionally, we show that an average of SYN2REAL task vectors, when we have real speeches from multiple different domains, can further adapt the original ASR model to perform better on the target text domain.
Paper Structure (34 sections, 4 equations, 6 figures, 5 tables)

This paper contains 34 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the SYN2REAL Task Vector Approach. The pre-trained model is fine-tuned on source domain synthetic and real speech data, separately. The difference between their parameters forms the SYN2REAL task vector. The SYN2REAL task vector is then added to a model fine-tuned on target synthetic data to overcome the synthetic-to-real gap.
  • Figure 2: Domain Shifts in ASR Domain Adaptation. Illustration of domain adaptation challenges in ASR, showing shifts between synthetic and real speech across source and target textual domains.
  • Figure 3: Framework for SYN2REAL task vector in Domain Adaptation for ASR. The framework illustrates the process of creating the SYN2REAL task vector by subtracting the parameter differences between a model fine-tuned on synthetic speech (Source Synthetic) and a model fine-tuned on real speech (Source Real) from pretrained ASR (PASR). This task vector is then applied to the target synthetic domain (Target Synthetic) to improve ASR performance by bridging the gap between synthetic and real speech data.
  • Figure 4: WER vs. Scaling Factor across Different ASR Models & Different TTS Models The plot shows the average WER on 'Cooking', 'Music', 'Social', and 'Weather' target domains as a function of the scaling factor $\lambda$ for various ASR models (Whisper and W2V2-conformer) and the TTS models (BARK and Speech T5) to make SYN2REAL task vectors. We denote it as '{ASR+TTS}', such as 'Whisper Tiny+BARK' in the figure. The scaling factor adjusts the magnitude of the SYN2REAL task vector applied to each model.
  • Figure 5: Cosine Similarity between task vectors derived from Different TTS Models. This heatmap shows the cosine similarity between task vectors generated by BARK (B_) and Speech T5 (S_) models. Higher similarity values between vectors from similar domains indicate effective acoustic-specific information transfer by the SYN2REAL method.
  • ...and 1 more figures