Table of Contents
Fetching ...

Example-Based Framework for Perceptually Guided Audio Texture Generation

Purnima Kamath, Chitralekha Gupta, Lonce Wyse, Suranga Nanayakkara

TL;DR

This work tackles the lack of large labeled datasets for audio textures by proposing an Example-Based Framework (EBF) that discovers user-defined semantic guidance vectors in the latent space of an unconditionally trained StyleGAN2 for audio textures. It leverages a GAN Encoder to invert spectrograms into the latent space, and uses synthetic Gaver-based queries to form semantic clusters and prototypes, yielding a direction vector that enables linear, perceptually meaningful edits via $\mathbf{w}_{edited} = \mathbf{w} + \alpha \mathbf{d}$ with $0<\alpha<1$. The method is validated on two texture datasets (impact textures and water fill textures) through objective metrics like Fréchet Audio Distance and a rescoring analysis, as well as perceptual listening tests, and is extended to selective semantic attribute transfer. The approach offers label-free controllability, demonstrates superior attribute control compared to SeFa, and suggests practical applications for texture synthesis in media production and beyond, while outlining limitations and future directions such as manifold constraints, out-of-distribution querying, and integration with text-to-audio models.

Abstract

Controllable generation using StyleGANs is usually achieved by training the model using labeled data. For audio textures, however, there is currently a lack of large semantically labeled datasets. Therefore, to control generation, we develop a method for semantic control over an unconditionally trained StyleGAN in the absence of such labeled datasets. In this paper, we propose an example-based framework to determine guidance vectors for audio texture generation based on user-defined semantic attributes. Our approach leverages the semantically disentangled latent space of an unconditionally trained StyleGAN. By using a few synthetic examples to indicate the presence or absence of a semantic attribute, we infer the guidance vectors in the latent space of the StyleGAN to control that attribute during generation. Our results show that our framework can find user-defined and perceptually relevant guidance vectors for controllable generation for audio textures. Furthermore, we demonstrate an application of our framework to other tasks, such as selective semantic attribute transfer.

Example-Based Framework for Perceptually Guided Audio Texture Generation

TL;DR

This work tackles the lack of large labeled datasets for audio textures by proposing an Example-Based Framework (EBF) that discovers user-defined semantic guidance vectors in the latent space of an unconditionally trained StyleGAN2 for audio textures. It leverages a GAN Encoder to invert spectrograms into the latent space, and uses synthetic Gaver-based queries to form semantic clusters and prototypes, yielding a direction vector that enables linear, perceptually meaningful edits via with . The method is validated on two texture datasets (impact textures and water fill textures) through objective metrics like Fréchet Audio Distance and a rescoring analysis, as well as perceptual listening tests, and is extended to selective semantic attribute transfer. The approach offers label-free controllability, demonstrates superior attribute control compared to SeFa, and suggests practical applications for texture synthesis in media production and beyond, while outlining limitations and future directions such as manifold constraints, out-of-distribution querying, and integration with text-to-audio models.

Abstract

Controllable generation using StyleGANs is usually achieved by training the model using labeled data. For audio textures, however, there is currently a lack of large semantically labeled datasets. Therefore, to control generation, we develop a method for semantic control over an unconditionally trained StyleGAN in the absence of such labeled datasets. In this paper, we propose an example-based framework to determine guidance vectors for audio texture generation based on user-defined semantic attributes. Our approach leverages the semantically disentangled latent space of an unconditionally trained StyleGAN. By using a few synthetic examples to indicate the presence or absence of a semantic attribute, we infer the guidance vectors in the latent space of the StyleGAN to control that attribute during generation. Our results show that our framework can find user-defined and perceptually relevant guidance vectors for controllable generation for audio textures. Furthermore, we demonstrate an application of our framework to other tasks, such as selective semantic attribute transfer.
Paper Structure (29 sections, 3 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Schematic outlining the modules within our framework. (a) A StyleGAN's generator. Mapping network $G_m$ maps latent space $\mathcal{Z}$ to intermediate latent space $\mathcal{W}$ ($\mathbb{R}^{\delta_z}\rightarrow\mathbb{R}^{\delta_w}$). Synthesis network $G_s$ maps an intermediate latent vector $\mathbf{w}$ to spectrograms $\mathbf{S}$ ($\mathbb{R}^{\delta_w}\rightarrow\mathbb{R}^{f \times t}$). (b) Schematic of an Encoder $E$ which inverts spectrograms to the intermediate latent space $\mathcal{W}$ ($\mathbb{R}^{f \times t}\rightarrow\mathbb{R}^{\delta_w}$). (c) Schematic of our framework during inference.
  • Figure 2: Schematic for generating semantic attribute clusters, prototypes $\mathbf{w_{p1}}$ and $\mathbf{w_{p2}}$, and the direction vector $\mathbf{d}$.
  • Figure 3: (Top Row) Spectrogram examples of guided generation using our method based on change in the attributes of (a) Rate (increases L to R), (b) Impact Type (becomes scratchy L to R), and (c) Brightness (decreases L to R). Note that for each example, as one attribute changes, the other attributes do not undergo a change. (Bottom Row) Examples of guided generation for water filling a container based on Fill-Level. Note how the Fill-Level and its respective frequency components gradually increase from L to R. All sounds can be auditioned on our webpage https://pkamath2.github.io/audio-guided-generation.
  • Figure 4: Semantic attribute transfer from a reference sample $\mathbf{w_{ref}}$ to a target $\mathbf{w}$, with direction vector $\mathbf{w_{p1}}\rightarrow\mathbf{w_{p2}}$ representing, say an increasing level of "Brightness". Both $\mathbf{w_{ref}}$ and $\mathbf{w}$ are projected onto to the direction vector $\mathbf{d}$. The difference vector $\mathbf{d'}$ is used to selectively edit $\mathbf{w}$ to generate $\mathbf{w'}$. $\mathbf{w'}$ will have the same brightness relationship to $\mathbf{w}$ as $\mathbf{w_{ref}}$.