Table of Contents
Fetching ...

Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

Viet Anh Trinh, Pegah Ghahremani, Brian King, Jasha Droppo, Andreas Stolcke, Roland Maas

TL;DR

The paper tackles geographic disparities in ASR by applying Elastic Weight Consolidation (EWC) to adapt a pretrained RNN-T model to high-WER regions while preserving performance for the overall user population. The adaptation loss combines the standard ASR objective with a regularization term $\mathcal{L}(\theta)=\mathcal{L}_{ASR}(\theta)+\frac{\lambda}{2}\sum_i F_i(\theta_i-\theta_{p,i}^*)^2$ using a diagonal Fisher matrix $F_i$ to constrain updates to important directions. Empirical results show the proposed method reduces the region with the highest WER by $3.2\%$ relative and the overall WER by $1.3\%$ relative, with a $7.9\%$ reduction in WER variance across regions, outperforming other transfer-learning baselines. The analysis also indicates that adapting the language-model component contributes significantly to fairness, and the approach generalizes to other scenarios requiring dataset-specific adaptation without forgetting prior knowledge.

Abstract

We present an approach to reduce the performance disparity between geographic regions without degrading performance on the overall user population for ASR. A popular approach is to fine-tune the model with data from regions where the ASR model has a higher word error rate (WER). However, when the ASR model is adapted to get better performance on these high-WER regions, its parameters wander from the previous optimal values, which can lead to worse performance in other regions. In our proposed method, we utilize the elastic weight consolidation (EWC) regularization loss to identify directions in parameters space along which the ASR weights can vary to improve for high-error regions, while still maintaining performance on the speaker population overall. Our results demonstrate that EWC can reduce the word error rate (WER) in the region with highest WER by 3.2% relative while reducing the overall WER by 1.3% relative. We also evaluate the role of language and acoustic models in ASR fairness and propose a clustering algorithm to identify WER disparities based on geographic region.

Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

TL;DR

The paper tackles geographic disparities in ASR by applying Elastic Weight Consolidation (EWC) to adapt a pretrained RNN-T model to high-WER regions while preserving performance for the overall user population. The adaptation loss combines the standard ASR objective with a regularization term using a diagonal Fisher matrix to constrain updates to important directions. Empirical results show the proposed method reduces the region with the highest WER by relative and the overall WER by relative, with a reduction in WER variance across regions, outperforming other transfer-learning baselines. The analysis also indicates that adapting the language-model component contributes significantly to fairness, and the approach generalizes to other scenarios requiring dataset-specific adaptation without forgetting prior knowledge.

Abstract

We present an approach to reduce the performance disparity between geographic regions without degrading performance on the overall user population for ASR. A popular approach is to fine-tune the model with data from regions where the ASR model has a higher word error rate (WER). However, when the ASR model is adapted to get better performance on these high-WER regions, its parameters wander from the previous optimal values, which can lead to worse performance in other regions. In our proposed method, we utilize the elastic weight consolidation (EWC) regularization loss to identify directions in parameters space along which the ASR weights can vary to improve for high-error regions, while still maintaining performance on the speaker population overall. Our results demonstrate that EWC can reduce the word error rate (WER) in the region with highest WER by 3.2% relative while reducing the overall WER by 1.3% relative. We also evaluate the role of language and acoustic models in ASR fairness and propose a clustering algorithm to identify WER disparities based on geographic region.
Paper Structure (14 sections, 12 equations, 1 figure, 1 table, 2 algorithms)

This paper contains 14 sections, 12 equations, 1 figure, 1 table, 2 algorithms.

Figures (1)

  • Figure 1: 126 regions identified by the clustering tree. The color does not indicate specific WER, however regions with the same color have the same WER.