Table of Contents
Fetching ...

Just Because We Camp, Doesn't Mean We Should: The Ethics of Modelling Queer Voices

Atli Sigurgeirsson, Eddie L. Ungless

TL;DR

The paper investigates the ethics of modeling queer voices in text-to-speech by testing whether a standard multi-speaker TTS pipeline can capture 'gay voice' and how such modeling affects perceived vocal identity. Using a case study with 20 self-identified gay speakers and 24 controls drawn from Ted-Lium 3, evaluated by LGBTQ+ participants, the authors compare ground-truth, enhanced, and synthesized utterances across segments and utterances with a voice-cloning system XTTS-v2. They find that synthesis tends to homogenize gay voice, reducing perceived gayness for gay speakers while unexpectedly increasing perceived gayness for control speakers, and that perceived loss of gay voice correlates with reduced speaker similarity for gay speakers. Given strong safety and fairness concerns, they argue against publicly releasing queer-voicing synthesis and discuss risks such as data privacy, misappropriation, and 'gaydars' that could endanger queer individuals.

Abstract

Modern voice cloning models claim to be able to capture a diverse range of voices. We test the ability of a typical pipeline to capture the style known colloquially as "gay voice" and notice a homogenisation effect: synthesised speech is rated as sounding significantly "less gay" (by LGBTQ+ participants) than its corresponding ground-truth for speakers with "gay voice", but ratings actually increase for control speakers. Loss of "gay voice" has implications for accessibility. We also find that for speakers with "gay voice", loss of "gay voice" corresponds to lower similarity ratings. However, we caution that improving the ability of such models to synthesise ``gay voice'' comes with a great number of risks. We use this pipeline as a starting point for a discussion on the ethics of modelling queer voices more broadly. Collecting "clean" queer data has safety and fairness ramifications, and the resulting technology may cause harms from mockery to death.

Just Because We Camp, Doesn't Mean We Should: The Ethics of Modelling Queer Voices

TL;DR

The paper investigates the ethics of modeling queer voices in text-to-speech by testing whether a standard multi-speaker TTS pipeline can capture 'gay voice' and how such modeling affects perceived vocal identity. Using a case study with 20 self-identified gay speakers and 24 controls drawn from Ted-Lium 3, evaluated by LGBTQ+ participants, the authors compare ground-truth, enhanced, and synthesized utterances across segments and utterances with a voice-cloning system XTTS-v2. They find that synthesis tends to homogenize gay voice, reducing perceived gayness for gay speakers while unexpectedly increasing perceived gayness for control speakers, and that perceived loss of gay voice correlates with reduced speaker similarity for gay speakers. Given strong safety and fairness concerns, they argue against publicly releasing queer-voicing synthesis and discuss risks such as data privacy, misappropriation, and 'gaydars' that could endanger queer individuals.

Abstract

Modern voice cloning models claim to be able to capture a diverse range of voices. We test the ability of a typical pipeline to capture the style known colloquially as "gay voice" and notice a homogenisation effect: synthesised speech is rated as sounding significantly "less gay" (by LGBTQ+ participants) than its corresponding ground-truth for speakers with "gay voice", but ratings actually increase for control speakers. Loss of "gay voice" has implications for accessibility. We also find that for speakers with "gay voice", loss of "gay voice" corresponds to lower similarity ratings. However, we caution that improving the ability of such models to synthesise ``gay voice'' comes with a great number of risks. We use this pipeline as a starting point for a discussion on the ethics of modelling queer voices more broadly. Collecting "clean" queer data has safety and fairness ramifications, and the resulting technology may cause harms from mockery to death.
Paper Structure (17 sections, 2 figures, 1 table)

This paper contains 17 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Mean gay voice rating of both speaker groups in each clip type.
  • Figure 2: Graphs showing the correlation between the similarity accuracy and reduction in gay voice between reference Segments and Synth (A positive value indicates a loss). Lines represent LOWESS-smoothing and shaded area indicates the 95% confidence intervals.