Table of Contents
Fetching ...

Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

Annie Chu, Patrick O'Reilly, Julia Barnett, Bryan Pardo

TL;DR

This paper addresses open-vocabulary control of audio effects by mapping natural language prompts to differentiable FX parameters without retraining. It introduces Text2FX, which leverages CLAP embeddings and differentiable DSP to perform single-instance optimization that aligns audio embeddings with text prompts. Two optimization strategies—cosine-based and directional—are proposed and evaluated through a comprehensive listener study across multiple prompts and FX chains (EQ and Reverb). Results indicate CLAP contains actionable information for FX control, with the directional approach offering better generalization, while both variants provide complementary strengths. The work advances semantic audio production by enabling intuitive, open-vocabulary manipulation of audio effects and suggests avenues for interactive, human-in-the-loop interfaces in creative contexts.

Abstract

This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g., "make this sound in-your-face and bold"). Text2FX operates without retraining any models, relying instead on single-instance optimization within the existing embedding space, thus enabling a flexible, scalable approach to open-vocabulary sound transformations through interpretable and disentangled FX manipulation. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text-audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception. Demos and code are available at anniejchu.github.io/text2fx.

Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

TL;DR

This paper addresses open-vocabulary control of audio effects by mapping natural language prompts to differentiable FX parameters without retraining. It introduces Text2FX, which leverages CLAP embeddings and differentiable DSP to perform single-instance optimization that aligns audio embeddings with text prompts. Two optimization strategies—cosine-based and directional—are proposed and evaluated through a comprehensive listener study across multiple prompts and FX chains (EQ and Reverb). Results indicate CLAP contains actionable information for FX control, with the directional approach offering better generalization, while both variants provide complementary strengths. The work advances semantic audio production by enabling intuitive, open-vocabulary manipulation of audio effects and suggests avenues for interactive, human-in-the-loop interfaces in creative contexts.

Abstract

This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g., "make this sound in-your-face and bold"). Text2FX operates without retraining any models, relying instead on single-instance optimization within the existing embedding space, thus enabling a flexible, scalable approach to open-vocabulary sound transformations through interpretable and disentangled FX manipulation. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text-audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception. Demos and code are available at anniejchu.github.io/text2fx.
Paper Structure (12 sections, 3 figures, 2 tables)

This paper contains 12 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Text2FX Example. A previous study seetharaman2016audealize found listeners associate 'bright' with boosting high frequencies ($>$ 2 kHz) and cutting low ones ($<$ 2 kHz). Optimizing the audio in a shared text-audio embedding space (CLAP) towards the embedding for text 'bright' achieves this;. Left: Optimization loss curve. Right: Estimated settings for a 6-band parametric EQ.
  • Figure 2: Left, Text2FX-cosine: Input audio (A) and target prompt (T) are mapped into the same (CLAP) embedding space. A is optimized to move its embedding closer to T, resulting in modified audio (A'). Right, Text2FX-directional: Both the directional vector between a contrasting prompt (T1) and target prompt (T2) and the vector between input audio (A1) and 'effected' audio (A2) are measured. A2 is optimized to make the vector between audio embeddings align with the vector between text embeddings, resulting in A2'.
  • Figure 3: Left: The mean listener evaluation score. Right: The amount by which the mean evaluation score beats the mean listener evaluation score achieved by a random effect. Higher numbers are better. In all conditions, Text2FX-best has a positive mean listener score and always beats Random.