Table of Contents
Fetching ...

SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation

Stephen Brade, Sam Anderson, Rithesh Kumar, Zeyu Jin, Anh Truong

TL;DR

This paper addresses the difficulty novice creators face when generating expressive voiceovers using text-to-speech (TTS) systems. It combines a comparative study of existing TTS interfaces with expert interviews of professional voice actors to derive design guidelines for expressive-TTS interfaces, then implements SpeakEasy, a Wizard-of-Oz system that conditions TTS output on high-level context and supports sentence-level refinement and diverse takes. The within-subject evaluation with twelve creators shows that SpeakEasy yields significantly better first-generation suitability, greater variety, superior steerability, and expanded creative horizons with comparable or slightly reduced effort relative to baselines, though naturalness can lag behind the best human-like baselines due to the Wizard-of-Oz artifacts. Overall, the work provides actionable design principles and a practical prototype demonstrating how context-informed, high-level feedback workflows can substantially improve expressive content creation, informing future real-time TTS systems and datasets. The findings underscore the potential for context-conditioned expressive TTS to accelerate content production while expanding creative possibilities for creators.

Abstract

Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.

SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation

TL;DR

This paper addresses the difficulty novice creators face when generating expressive voiceovers using text-to-speech (TTS) systems. It combines a comparative study of existing TTS interfaces with expert interviews of professional voice actors to derive design guidelines for expressive-TTS interfaces, then implements SpeakEasy, a Wizard-of-Oz system that conditions TTS output on high-level context and supports sentence-level refinement and diverse takes. The within-subject evaluation with twelve creators shows that SpeakEasy yields significantly better first-generation suitability, greater variety, superior steerability, and expanded creative horizons with comparable or slightly reduced effort relative to baselines, though naturalness can lag behind the best human-like baselines due to the Wizard-of-Oz artifacts. Overall, the work provides actionable design principles and a practical prototype demonstrating how context-informed, high-level feedback workflows can substantially improve expressive content creation, informing future real-time TTS systems and datasets. The findings underscore the potential for context-conditioned expressive TTS to accelerate content production while expanding creative possibilities for creators.

Abstract

Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.

Paper Structure

This paper contains 79 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The ElevenLabs interface contains (A) a drop-down menu for voice selection; (B) a text box editor for inputting the script and generating speech; (C) a control panel with sliders for stability, similarity, and style exaggeration, which control model hyperparameters and model stochasticity; (D) a history tab for revisiting past generations.
  • Figure 2: The Speechify interface contains (A) a drop down menu for voice selection with regional accent labels and avatars; (B) a text box editor for inputting the script and generating speech; (C) a tone control drop-down menu (for some voices) which contains a list of emotions, and an intensity slider which changes the tone with respect to a given emotion and intensity; (D) granular sliders for speed, pitch, and volume control as well as additional controls to change pronunciation and manipulate pauses; (E) the text block feature which allows users to parse the text, and apply different controls to different parts of the script.
  • Figure 3: We present a labelled version of the Wizard of Oz interface, SpeakEasy, with brief descriptions of each feature's functionality. 1. Script Content View:(A)Script Input: Users upload a script by loading a text file which is then displayed in this window. (B)Additional Context Input: Users can enter additional context to how they would like the script to be performed. (C)Voice Selection: Users can select one of two voices: Brian or Jessica. After the user clicks generate, they are taken to the Script Editing View. Script Editing View:(D)Edit context button: a back button so users can at any time return to the initial context menu to try a new prompt. (E)Current context display: Users can see the context they've provided or view the suggested context. (F)Undo and Redo: Users can undo any change at any time. (G)Playback bar: Participants can drag this slider to any word in the generated speech. (H)Active Sentence: The sentence currently being edited in the Sentence Iteration Menu. 3. Sentence Iteration Menu: A menu containing features that help users modify a given sentence. (I) a playback speed slider. (J)Adjective Recommendations adjectives suggested to users based on the current sentence which, when selected, modify the generated speech to reflect this adjective. The first two are fitting, and the third adjective contrasts the first two. (K)Freeform Text Input: allows users to modify a sentence with any text that comes to mind. (L)Surprise Take: a button which will give the user a completely new performance at random. (M)Comparison Tabs: tabs labelled with the word used to edit the sentence that allow users to access past iterations of that sentence, and compare it to the current iteration. (N) a download button to save a performance.
  • Figure 4: We use a Wilcoxon-Signed Rank Test to evaluate the significance of results comparing SpeakEasy (blue; middle boxes) with Speechify (salmon; right boxes), and ElevenLabs (yellow; left boxes) where significant p-values are indicated by brackets, and stars above the box plots ($-: p > .100$, $+: .050 < p < .100$, $*: p < .050$, $**: p < .010$, $***: p < .001$). Full brackets indicate significant results over both baseline interfaces, half brackets indicate significant results over one baseline interface. Dots are the mean rating, red lines are the median, box heights are interquartile range (IQR), whiskers correspond to the highest and lowest datum within 1.5 times Q3 and Q1, respectively. Datum outside of the whiskers are labeled as circles. Performance is preferable when higher and has an inverted scale from the rest of the NASA-TLX metrics which are preferable when lower.