Audio Editing with Non-Rigid Text Prompts

Francesco Paissan; Luca Della Libera; Zhepei Wang; Mirco Ravanelli; Paris Smaragdis; Cem Subakan

Audio Editing with Non-Rigid Text Prompts

Francesco Paissan, Luca Della Libera, Zhepei Wang, Mirco Ravanelli, Paris Smaragdis, Cem Subakan

TL;DR

It is shown that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio and which outperform Audio-LDM, a recently released text-prompted audio generation model.

Abstract

In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.

Audio Editing with Non-Rigid Text Prompts

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 4 figures, 1 table)

This paper contains 13 sections, 4 equations, 4 figures, 1 table.

Introduction
Methodology
Step 1: Embedding optimization
Step 2: Fine-tuning
Step 3: Interpolation
Experimental setup
Edit types
Quantitative metrics
Text-to-audio model and optimization parameters
Results
Qualitative comparison
Quantitative evaluation
Conclusions

Figures (4)

Figure 1: In step 1, we optimize with respect to $\mathbf e_\text{text}$ to minimize the reconstruction error in Eq. \ref{['eq:difeq']}, where $\mathbf z=\text{VAEEncoder}(X_\text{inp})$. The resulting optimized text embedding is denoted with $\mathbf e_\text{opt}$. In step 2, we optimize with respect to the Diffusion Model parameters to minimize the same reconstruction loss as in step 1. Note that in steps 1 and 2 only the part shown with the green box is used. In step 3, the text embedding is set as the linear combination of target embedding and the optimized embedding such that $\mathbf e_\text{text} = \eta \mathbf e_\text{target} + (1-\eta) \mathbf e_\text{opt}$. In step 3, the whole pipeline denoted by the yellow box is used.
Figure 2: Example showcasing a style transfer experiment. The input audio that corresponds to 'phone ringing' is on the top row, and the text describing the target edit is 'Sound of knocking on the door'. Our edited audio (bottom row) maintains the same onset pattern as the original sample better than AudioLDM (middle row). From left to right, we increased the edit strength to showcase qualitatively the patterns highlighted in Fig. \ref{['fig:results']}.
Figure 3: Comparisons of user preferences on Addition, Inpainting and Style Transfer tasks
Figure 4: Quantitative evaluation of the addition and style transfer edits over 9 samples for each type. The trends are in line with the intuition of $\eta$ as a parameter to control the edit strength.

Audio Editing with Non-Rigid Text Prompts

TL;DR

Abstract

Audio Editing with Non-Rigid Text Prompts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)