SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation
Stephen Brade, Sam Anderson, Rithesh Kumar, Zeyu Jin, Anh Truong
TL;DR
This paper addresses the difficulty novice creators face when generating expressive voiceovers using text-to-speech (TTS) systems. It combines a comparative study of existing TTS interfaces with expert interviews of professional voice actors to derive design guidelines for expressive-TTS interfaces, then implements SpeakEasy, a Wizard-of-Oz system that conditions TTS output on high-level context and supports sentence-level refinement and diverse takes. The within-subject evaluation with twelve creators shows that SpeakEasy yields significantly better first-generation suitability, greater variety, superior steerability, and expanded creative horizons with comparable or slightly reduced effort relative to baselines, though naturalness can lag behind the best human-like baselines due to the Wizard-of-Oz artifacts. Overall, the work provides actionable design principles and a practical prototype demonstrating how context-informed, high-level feedback workflows can substantially improve expressive content creation, informing future real-time TTS systems and datasets. The findings underscore the potential for context-conditioned expressive TTS to accelerate content production while expanding creative possibilities for creators.
Abstract
Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.
