Table of Contents
Fetching ...

LVNS-RAVE: Diversified audio generation with RAVE and Latent Vector Novelty Search

Jinyue Guo, Anna-Maria Christodoulou, Balint Laczko, Kyrre Glette

TL;DR

The paper tackles the tension between realism and creativity in audio generation by combining a deep generative model (RAVE) with Latent Variable Evolution guided by a novelty objective. It evolves latent-space vectors $s_i \\in\mathbb{R}^{d\times l}$ decoded by $RDec$ to audio, and uses a VGGish-based perceptual distance $dist(s_i,s_j)=\left\| VGG\text{ish}(RDec(s_i)) - VGG\text{ish}(RDec(s_j))\right\|$ to compute novelty $\rho(s_i)=\frac{1}{k}\sum_{j\in U} dist(s_i,s_j)$ with $k=50$, selecting $N$ samples that maximize total novelty. By initializing, crossbreeding, mutating, and selecting in the RAVE latent space, the method seeks diverse yet realistic outputs, evaluated across three pre-trained RAVE models and multiple setups. The approach demonstrates increased diversity over generations and provides a controllable framework for artists to balance fidelity and novelty, advancing creative audio generation. Overall, LVNS-RAVE offers a principled pathway to unleash novelty in deep audio synthesis while maintaining perceptual quality.

Abstract

Evolutionary Algorithms and Generative Deep Learning have been two of the most powerful tools for sound generation tasks. However, they have limitations: Evolutionary Algorithms require complicated designs, posing challenges in control and achieving realistic sound generation. Generative Deep Learning models often copy from the dataset and lack creativity. In this paper, we propose LVNS-RAVE, a method to combine Evolutionary Algorithms and Generative Deep Learning to produce realistic and novel sounds. We use the RAVE model as the sound generator and the VGGish model as a novelty evaluator in the Latent Vector Novelty Search (LVNS) algorithm. The reported experiments show that the method can successfully generate diversified, novel audio samples under different mutation setups using different pre-trained RAVE models. The characteristics of the generation process can be easily controlled with the mutation parameters. The proposed algorithm can be a creative tool for sound artists and musicians.

LVNS-RAVE: Diversified audio generation with RAVE and Latent Vector Novelty Search

TL;DR

The paper tackles the tension between realism and creativity in audio generation by combining a deep generative model (RAVE) with Latent Variable Evolution guided by a novelty objective. It evolves latent-space vectors decoded by to audio, and uses a VGGish-based perceptual distance to compute novelty with , selecting samples that maximize total novelty. By initializing, crossbreeding, mutating, and selecting in the RAVE latent space, the method seeks diverse yet realistic outputs, evaluated across three pre-trained RAVE models and multiple setups. The approach demonstrates increased diversity over generations and provides a controllable framework for artists to balance fidelity and novelty, advancing creative audio generation. Overall, LVNS-RAVE offers a principled pathway to unleash novelty in deep audio synthesis while maintaining perceptual quality.

Abstract

Evolutionary Algorithms and Generative Deep Learning have been two of the most powerful tools for sound generation tasks. However, they have limitations: Evolutionary Algorithms require complicated designs, posing challenges in control and achieving realistic sound generation. Generative Deep Learning models often copy from the dataset and lack creativity. In this paper, we propose LVNS-RAVE, a method to combine Evolutionary Algorithms and Generative Deep Learning to produce realistic and novel sounds. We use the RAVE model as the sound generator and the VGGish model as a novelty evaluator in the Latent Vector Novelty Search (LVNS) algorithm. The reported experiments show that the method can successfully generate diversified, novel audio samples under different mutation setups using different pre-trained RAVE models. The characteristics of the generation process can be easily controlled with the mutation parameters. The proposed algorithm can be a creative tool for sound artists and musicians.
Paper Structure (8 sections, 3 equations, 3 figures, 1 table)

This paper contains 8 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Experiment group 1: Sparseness of the containers during the evolution process of three different pre-trained models, all using experiment Setup 1. Lines show the mean sparseness; shadow areas show the standard deviation.
  • Figure 2: Experiment group 2: Sparseness of the containers during the evolution process using VCTK pre-train model, under 4 different experiment setups. Lines show the mean sparseness and shadow areas show the standard deviation.
  • Figure 3: Plot of the first two dimensions of the container embedding sequences, generated using Setup 3. Colored lines show the trajectories of five sequences. Gray dots show the distribution of all other embedding vectors in the container.