Towards Controllable Audio Texture Morphing

Chitralekha Gupta; Purnima Kamath; Yize Wei; Zhuoyao Li; Suranga Nanayakkara; Lonce Wyse

Towards Controllable Audio Texture Morphing

Chitralekha Gupta, Purnima Kamath, Yize Wei, Zhuoyao Li, Suranga Nanayakkara, Lonce Wyse

TL;DR

A data-driven approach to train a Generative Adversarial Network conditioned on "soft-labels" distilled from the penultimate layer of an audio classifier trained on a target set of audio texture classes demonstrates that interpolation between conditions or control vectors provide smooth morphing between the generated audio textures.

Abstract

In this paper, we propose a data-driven approach to train a Generative Adversarial Network (GAN) conditioned on "soft-labels" distilled from the penultimate layer of an audio classifier trained on a target set of audio texture classes. We demonstrate that interpolation between such conditions or control vectors provides smooth morphing between the generated audio textures, and shows similar or better audio texture morphing capability compared to the state-of-the-art methods. The proposed approach results in a well-organized latent space that generates novel audio outputs while remaining consistent with the semantics of the conditioning parameters. This is a step towards a general data-driven approach to designing generative audio models with customized controls capable of traversing out-of-distribution regions for novel sound synthesis.

Towards Controllable Audio Texture Morphing

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 5 figures, 3 tables)

This paper contains 14 sections, 1 equation, 5 figures, 3 tables.

Introduction
Conditional GAN
One-Hot GAN
MorphGAN
Experimental Setup
Datasets
Architectures
Models
Results
Audio Quality
Intra-class Morphing
Inter-class Morphing
Semantic exploration of inter-class morphing
Conclusion

Figures (5)

Figure 1: System overview. GAN input features are a random noise latent vector $Z_p$ ($p$-dim), along with either (a) One-hot vectors for intra-class parameter $P_q$ ($q$-dim) and class-identity parameter $C_r$ ($r$-dim), or (b) Morph-GAN with one dimensional intra-class parameter $P_1$ but $x$ dimensional soft labels for class parameter $C_x$ from the output of the penultimate layer of a pre-trained $n$-class audio classifier.
Figure 2: Three dimensional soft label values from the penultimate layer of the audio classifier. These are subsequently used for conditioning MorphGAN. The blue markers are water-filling sounds and the red markers are the wind sounds.
Figure 3: Concatenated 2s audio outputs from (a) One-Hot GAN, and (b) MorphGAN as the class parameter $C$ interpolates between values for wind to water in 11 steps while keeping $P$ fixed.
Figure 4: Output activation values for node0 (water) of classifier for audio generated from (a) One-Hot GAN, and (b) MorphGAN. The Y-axis is the node0 (water) output from the classifier, and the X-axis is the class parameter interpolated from water to wind.
Figure 5: Spectrogram (top) of concatenated audio outputs of 2s and corresponding audio classifier node activations (bottom) as class parameter (a) dimension 0, (b) dimension 1, and (c) dimension 2 are varied from 0 to 1 at steps of 0.1. Other dimensions are fixed.

Towards Controllable Audio Texture Morphing

TL;DR

Abstract

Towards Controllable Audio Texture Morphing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)