Table of Contents
Fetching ...

Hidden Echoes Survive Training in Audio To Audio Generative Instrument Models

Christopher J. Tralie, Matt Amery, Benjamin Douglas, Ian Utz

TL;DR

It is shown that echoes make their way into fine tuned models, that they survive mixing/demixing, and that they survive pitch shift augmentation during training, and this simple, classical idea in watermarking shows significant promise for tagging generative audio models.

Abstract

As generative techniques pervade the audio domain, there has been increasing interest in tracing back through these complicated models to understand how they draw on their training data to synthesize new examples, both to ensure that they use properly licensed data and also to elucidate their black box behavior. In this paper, we show that if imperceptible echoes are hidden in the training data, a wide variety of audio to audio architectures (differentiable digital signal processing (DDSP), Realtime Audio Variational autoEncoder (RAVE), and ``Dance Diffusion'') will reproduce these echoes in their outputs. Hiding a single echo is particularly robust across all architectures, but we also show promising results hiding longer time spread echo patterns for an increased information capacity. We conclude by showing that echoes make their way into fine tuned models, that they survive mixing/demixing, and that they survive pitch shift augmentation during training. Hence, this simple, classical idea in watermarking shows significant promise for tagging generative audio models.

Hidden Echoes Survive Training in Audio To Audio Generative Instrument Models

TL;DR

It is shown that echoes make their way into fine tuned models, that they survive mixing/demixing, and that they survive pitch shift augmentation during training, and this simple, classical idea in watermarking shows significant promise for tagging generative audio models.

Abstract

As generative techniques pervade the audio domain, there has been increasing interest in tracing back through these complicated models to understand how they draw on their training data to synthesize new examples, both to ensure that they use properly licensed data and also to elucidate their black box behavior. In this paper, we show that if imperceptible echoes are hidden in the training data, a wide variety of audio to audio architectures (differentiable digital signal processing (DDSP), Realtime Audio Variational autoEncoder (RAVE), and ``Dance Diffusion'') will reproduce these echoes in their outputs. Hiding a single echo is particularly robust across all architectures, but we also show promising results hiding longer time spread echo patterns for an increased information capacity. We conclude by showing that echoes make their way into fine tuned models, that they survive mixing/demixing, and that they survive pitch shift augmentation during training. Hence, this simple, classical idea in watermarking shows significant promise for tagging generative audio models.

Paper Structure

This paper contains 15 sections, 6 equations, 12 figures.

Figures (12)

  • Figure 1: An example of cepstra computed on style transfer of a 30 second excerpt of a Prince jazz session at Loring Park. RAVE models trained on data with different echoes at 50, 75, and 100 lead to visible peaks at the respective places in their ceptra on the synthesized clips.
  • Figure 2: Comparing a 30 second style transfer using a RAVE model with a time spread echo pattern $p$ embedded in the training data to one without any pattern. The cross-correlation of the cepstrum with $p$ peaks for the model with the embedded pattern.
  • Figure 3: As this example with various tagged VocalSet training data shows, the z-scores for a 75 echo are much higher for the models that are trained on a dataset with a 75 echo embedded in every clip, and the separation increases with increasing clip duration.
  • Figure 4: The means and standard deviations of z-scores for datasets embedded with various single echoes (along each inner row) evaluated for different echoes (along each inner column) show that all architectures (outer rows) only strongly reproduce the echoes that they were trained on across all datasets (outer columns).
  • Figure 5: DDSP models show the strongest preservation of echoes over all model types, as measured by the z-score.
  • ...and 7 more figures