On the Encoding of Gender in Transformer-based ASR Representations

Aravind Krishnan; Badr M. Abdullah; Dietrich Klakow

On the Encoding of Gender in Transformer-based ASR Representations

Aravind Krishnan, Badr M. Abdullah, Dietrich Klakow

TL;DR

This work investigates the encoding and utilization of gender in the latent representations of two transformer-based ASR models, Wav2Vec2 and HuBERT and suggests the prospect of creating gender-neutral embeddings that can be integrated into ASR frameworks without compromising their efficacy.

Abstract

While existing literature relies on performance differences to uncover gender biases in ASR models, a deeper analysis is essential to understand how gender is encoded and utilized during transcript generation. This work investigates the encoding and utilization of gender in the latent representations of two transformer-based ASR models, Wav2Vec2 and HuBERT. Using linear erasure, we demonstrate the feasibility of removing gender information from each layer of an ASR model and show that such an intervention has minimal impacts on the ASR performance. Additionally, our analysis reveals a concentration of gender information within the first and last frames in the final layers, explaining the ease of erasing gender in these layers. Our findings suggest the prospect of creating gender-neutral embeddings that can be integrated into ASR frameworks without compromising their efficacy.

On the Encoding of Gender in Transformer-based ASR Representations

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 2 figures, 2 tables)

This paper contains 14 sections, 1 equation, 2 figures, 2 tables.

Introduction
Method
Linear Erasure
Concept Scrubbing
Data
Experimental Setup
Gender Scrubbing
Tracking Erasure
Effects on downstream ASR
Analysis
Frame-level probing
Cross-Position Snapshot Training
Conclusions and an Ethical Note
Acknowledgements

Figures (2)

Figure 1: Gender scrubbing for the Wav2Vec2 ASR model. Plots depict linear probe performances at the input and output when linearly erasing gender from the input at each layer. Mean probe performance on the original model is also shown.
Figure 2: Snapshot probing on the layers of pretrained (above) and fine-tuned (below) Wav2Vec2 and HuBERT. The colors indicate the F1-score of a probed trained at the indicated position. Results are shown for Librispeech.

On the Encoding of Gender in Transformer-based ASR Representations

TL;DR

Abstract

On the Encoding of Gender in Transformer-based ASR Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (2)