Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology
Robin Netzorg, Alyssa Cote, Sumi Koshin, Klo Vivienne Garoute, Gopala Krishna Anumanchipalli
TL;DR
The paper addresses the limitation of static, category-based speaker identity representations by introducing the Versatile Voice Dataset (VVD), which captures intra-speaker vocal texture variations across pitch, resonance, and weight from three trans-feminine teachers. Using VVD, the authors show that current ECAPA-TDNN speaker embeddings struggle with gender classification and exhibit high equal error rates in speaker verification when voices are drastically modified, revealing a gap between perceptual voice texture and binary gender labels. They compare expert and non-expert perceptual judgments, demonstrating that both humans and models have difficulty identifying same-speaker pairs as vocal distance increases. The work advocates a texture-centered modeling approach, proposing 3D vocal-feature baselines and perceptual-quality representations (PQ-Representation) to capture intra-speaker variation and guide the development of modification-robust embeddings with practical relevance for speech technology.
Abstract
As experts in voice modification, trans-feminine gender-affirming voice teachers have unique perspectives on voice that confound current understandings of speaker identity. To demonstrate this, we present the Versatile Voice Dataset (VVD), a collection of three speakers modifying their voices along gendered axes. The VVD illustrates that current approaches in speaker modeling, based on categorical notions of gender and a static understanding of vocal texture, fail to account for the flexibility of the vocal tract. Utilizing publicly-available speaker embeddings, we demonstrate that gender classification systems are highly sensitive to voice modification, and speaker verification systems fail to identify voices as coming from the same speaker as voice modification becomes more drastic. As one path towards moving beyond categorical and static notions of speaker identity, we propose modeling individual qualities of vocal texture such as pitch, resonance, and weight.
