Table of Contents
Fetching ...

Sketching With Your Voice: "Non-Phonorealistic" Rendering of Sounds via Vocal Imitation

Matthew Caren, Kartik Chandra, Joshua B. Tenenbaum, Jonathan Ragan-Kelley, Karima Ma

TL;DR

This work addresses the challenge of conveying auditory textures through vocal imitation by treating vocalization as a non-phonorealistic depiction. It introduces a principled, first-principles framework that combines a controllable source-filter vocal tract model with perceptual feature matching, and then couples this with rational speech acts (RSA) to enable communicative, listener-aware imitation. By adding cost-utility optimization, the full model closely tracks human imitation behavior (r^2 ≈ 0.81 with human data) and outperforms baselines in both human-rating and retrieval tasks, while remaining adaptable to constraints such as whispered speech. The approach offers a scalable, interpretable pathway toward non-phonorealistic sound rendering and intuitive sketch-based interfaces for sound design, with broad implications for graphics, cognitive science, and audio search.

Abstract

We present a method for automatically producing human-like vocal imitations of sounds: the equivalent of "sketching," but for auditory rather than visual representation. Starting with a simulated model of the human vocal tract, we first try generating vocal imitations by tuning the model's control parameters to make the synthesized vocalization match the target sound in terms of perceptually-salient auditory features. Then, to better match human intuitions, we apply a cognitive theory of communication to take into account how human speakers reason strategically about their listeners. Finally, we show through several experiments and user studies that when we add this type of communicative reasoning to our method, it aligns with human intuitions better than matching auditory features alone does. This observation has broad implications for the study of depiction in computer graphics.

Sketching With Your Voice: "Non-Phonorealistic" Rendering of Sounds via Vocal Imitation

TL;DR

This work addresses the challenge of conveying auditory textures through vocal imitation by treating vocalization as a non-phonorealistic depiction. It introduces a principled, first-principles framework that combines a controllable source-filter vocal tract model with perceptual feature matching, and then couples this with rational speech acts (RSA) to enable communicative, listener-aware imitation. By adding cost-utility optimization, the full model closely tracks human imitation behavior (r^2 ≈ 0.81 with human data) and outperforms baselines in both human-rating and retrieval tasks, while remaining adaptable to constraints such as whispered speech. The approach offers a scalable, interpretable pathway toward non-phonorealistic sound rendering and intuitive sketch-based interfaces for sound design, with broad implications for graphics, cognitive science, and audio search.

Abstract

We present a method for automatically producing human-like vocal imitations of sounds: the equivalent of "sketching," but for auditory rather than visual representation. Starting with a simulated model of the human vocal tract, we first try generating vocal imitations by tuning the model's control parameters to make the synthesized vocalization match the target sound in terms of perceptually-salient auditory features. Then, to better match human intuitions, we apply a cognitive theory of communication to take into account how human speakers reason strategically about their listeners. Finally, we show through several experiments and user studies that when we add this type of communicative reasoning to our method, it aligns with human intuitions better than matching auditory features alone does. This observation has broad implications for the study of depiction in computer graphics.
Paper Structure (20 sections, 4 equations, 9 figures)

This paper contains 20 sections, 4 equations, 9 figures.

Figures (9)

  • Figure 1: Stills from a short animation, where all 11 sound effects were vocal imitations produced by our system (see supplemental materials).
  • Figure 2: Schematic diagram of our source-filter model of the vocal tract (see Section \ref{['sec:naive']}).
  • Figure 3: The problem with the feature matching baseline (Section \ref{['sec:naive-issue']}). The sound of a motorboat is predominantly the water's loud broadband noise, well-matched by "ssh" (left). However, a speaker trying to imitate a motorboat is likelier to imitate the engine's rumble ("woh") because it would be more distinctive to a listener --- "shh" could be mistaken for wind.
  • Figure 4: Incorporating costs and utilities (Section \ref{['sec:rsa-costs']}). A speaker imitating a jackhammer might go with a softer, slower sound that is easier to make, and might be okay if the listener infers a different tool (but not, say, a tiger).
  • Figure 5: Vocal imitations generated with communicative reasoning are significantly more human-like than those produced by the feature-matching baseline. Averaged across stereotypically male and female voices, the correlation of phonetic features between communicative-only and human utterances is 0.65, as compared to 0.56 for the baseline model. Modeling the speaker's costs and utilities, as in our full method, further increases the correlation to 0.81.
  • ...and 4 more figures