Table of Contents
Fetching ...

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Zixuan Wang, Chi-Keung Tang, Yu-Wing Tai

TL;DR

This work proposes a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions that bridge the video and audio modality.

Abstract

We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions and calls the agent for audio generation. In doing so, Audio-Agent can generate high-quality audio that is closely aligned with the provided text or video exhibiting complex and multiple events, while supporting variable-length and variable-volume generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio, a process that can be tedious and time-consuming. Instead, we propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions that bridge the video and audio modality. Consequently, our framework contributes a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

TL;DR

This work proposes a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions that bridge the video and audio modality.

Abstract

We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions and calls the agent for audio generation. In doing so, Audio-Agent can generate high-quality audio that is closely aligned with the provided text or video exhibiting complex and multiple events, while supporting variable-length and variable-volume generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio, a process that can be tedious and time-consuming. Instead, we propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions that bridge the video and audio modality. Consequently, our framework contributes a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.
Paper Structure (20 sections, 4 equations, 6 figures, 15 tables)

This paper contains 20 sections, 4 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Example showing Audio-Agent's ability to generate, compose and edit multiple audio descriptions together. (A): Multi-turn editing; (B): Generation based on long description; (C): Multiple audio descriptions composition.
  • Figure 2: Overview of the TTA part. We use GPT-4 to convert a complex audio generation process into multiple generation steps and combine inference results.
  • Figure 3: Overview of the generation backbone. We build on top of the pre-trained Auffusion model for both TTA and VTA generation.
  • Figure 4: A demo example showing Audio-Agent's conversation ability: First turn: Audio Generation; second turn: Audio Insertion; third Turn: Audio Editing; last turn: Audio Composition with high-level semantic instructions. Audio-Agent can choose to respond based on previous turns or make independent generations. The corresponding audio files can be found in the supplementary materials.
  • Figure 5: Comparison with baseline on TTA. From top: our Audio-Agent, Auffusion, Make-An-Audio 2, and WavJourney. To demonstrate audio generation based on long complex text conditions, we ask the respective model to generate audio clips for 20 seconds. The text condition is drawn from the Two Captions category of Table \ref{['tab.tta_evaluation']}: (A) A river stream of water flowing followed by typing on a computer keyboard; (B) A woman delivering a speech followed by a male speech and statics; (C) A vehicle engine revving then accelerating at a high rate as a metal surface is whipped followed by tires skidding followed by a door shutting and a female speaking; (D) Continuous white noise followed by a vehicle driving as a man and woman are talking and laughing. Observe (by listening to the corresponding audio files in the supplementary material) our method successfully generates multi-event audio at different times based on descriptions, while other baselines mix the generated audio. Particularly for WavJourney, although the boundary between events is more distinct, it consistently does not obey the prompt to generate 20 seconds of audio. The generated audio clips are respectively of length 20 seconds, 30 seconds (truncated here), 16 seconds, and 12 seconds.
  • ...and 1 more figures