Gender-ambiguous voice generation through feminine speaking style transfer in male voices
Maria Koutsogiannaki, Shafel Mc Dowall, Ioannis Agiomyrgiannakis
TL;DR
This work addresses the need for gender-ambiguous synthetic voices by incorporating feminine speaking style into masculine timbre with pitch shifting toward the gender boundary. Using a female Azure TTS voice morphed onto male targets and shifted by $3$ and $4$ semitones toward the boundary at $170$ Hz, the authors generate candidate voices (mJames, mTaylor) and compare them to pitch-only baselines (pJames, pTaylor). A bias-resistant evaluation framework with three listening tests and clearly defined metrics demonstrates that style transfer yields greater gender-ambiguity than pitch modification alone, while maintaining good audio quality (with modest degradation). The study provides the first explicit emphasis on speaking style in gender-ambiguous voice generation, defines ambiguity criteria, and outlines a responsible AI evaluation approach suitable for post-processing TTS systems, laying groundwork for more inclusive voice technologies across diverse gender identities.
Abstract
Recently, and under the umbrella of Responsible AI, efforts have been made to develop gender-ambiguous synthetic speech to represent with a single voice all individuals in the gender spectrum. However, research efforts have completely overlooked the speaking style despite differences found among binary and non-binary populations. In this work, we synthesise gender-ambiguous speech by combining the timbre of a male speaker with the manner of speech of a female speaker using voice morphing and pitch shifting towards the male-female boundary. Subjective evaluations indicate that the ambiguity of the morphed samples that convey the female speech style is higher than those that undergo plain pitch transformations suggesting that the speaking style can be a contributing factor in creating gender-ambiguous speech. To our knowledge, this is the first study that explicitly uses the transfer of the speaking style to create gender-ambiguous voices.
