Table of Contents
Fetching ...

Interactive tools for making temporally variable, multiple-attributes, and multiple-instances morphing accessible: Flexible manipulation of divergent speech instances for explorational research and education

Hideki Kawahara, Masanori Morise

TL;DR

This work addresses the complexity barrier in generalized voice morphing by extending WORLD-based tools to support temporally variable, multi-attribute, and multi-instance morphing. It formalizes the generalized morphing framework with transformed-domain interpolation and introduces practical WORLD-based tools, including an integrated morphing-object preparation GUI and an App Designer demonstration for three-instance morphing. Key contributions include the anchor-assignment and time-frequency alignment workflow, procedural step-by-step guidance, and open-source deployment with tutorials to facilitate education and exploratory research. The approach enables accessible, hands-on exploration of speech production and perception, with potential pathways to integrate neural vocoders and end-to-end latent-variable representations while preserving user-driven control over morphing parameters.

Abstract

We generalized a voice morphing algorithm capable of handling temporally variable, multiple-attributes, and multiple instances. The generalized morphing provides a new strategy for investigating speech diversity. However, excessive complexity and the difficulty of preparation have prevented researchers and students from enjoying its benefits. To address this issue, we introduced a set of interactive tools to make preparation and tests less cumbersome. These tools are integrated into our previously reported interactive tools as extensions. The introduction of the extended tools in lessons in graduate education was successful. Finally, we outline further extensions to explore excessively complex morphing parameter settings.

Interactive tools for making temporally variable, multiple-attributes, and multiple-instances morphing accessible: Flexible manipulation of divergent speech instances for explorational research and education

TL;DR

This work addresses the complexity barrier in generalized voice morphing by extending WORLD-based tools to support temporally variable, multi-attribute, and multi-instance morphing. It formalizes the generalized morphing framework with transformed-domain interpolation and introduces practical WORLD-based tools, including an integrated morphing-object preparation GUI and an App Designer demonstration for three-instance morphing. Key contributions include the anchor-assignment and time-frequency alignment workflow, procedural step-by-step guidance, and open-source deployment with tutorials to facilitate education and exploratory research. The approach enables accessible, hands-on exploration of speech production and perception, with potential pathways to integrate neural vocoders and end-to-end latent-variable representations while preserving user-driven control over morphing parameters.

Abstract

We generalized a voice morphing algorithm capable of handling temporally variable, multiple-attributes, and multiple instances. The generalized morphing provides a new strategy for investigating speech diversity. However, excessive complexity and the difficulty of preparation have prevented researchers and students from enjoying its benefits. To address this issue, we introduced a set of interactive tools to make preparation and tests less cumbersome. These tools are integrated into our previously reported interactive tools as extensions. The introduction of the extended tools in lessons in graduate education was successful. Finally, we outline further extensions to explore excessively complex morphing parameter settings.
Paper Structure (18 sections, 1 equation, 7 figures)

This paper contains 18 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: GUI snapshot of the morphing object preparation. This snapshot shows the final stage of time-frequency alignment (at 6:41 of the video "10" in Fig. \ref{['fig:tutorialVideos']}). The sample (WAVE: 44100 Hz, 32 bit) is a Japanese word /hai/ ("Yes" in English) spoken by the first author.
  • Figure 2: Tutorial video channel of WORLD vocoder tools. This snapshot lists morphing tutorial videos. The QR code links to this YouTube channel worldYouTube.
  • Figure 3: GUI control parts for manipulation.
  • Figure 4: Temporal anchor assignment and adjustment (at 1:08 in video).
  • Figure 5: Frequency anchor assignment on the "non-linear" representation. (at 3:09)
  • ...and 2 more figures