Table of Contents
Fetching ...

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, Zeyu Jin

TL;DR

This work proposes AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories, and introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data.

Abstract

Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

TL;DR

This work proposes AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories, and introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data.

Abstract

Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.
Paper Structure (32 sections, 18 equations, 10 figures, 5 tables)

This paper contains 32 sections, 18 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of AudioChat's capabilities in multi-source audio storytelling and editing. AudioChat leverages structured CoT reasoning to break down the user prompt into individual sound effects, allowing for fine-grained control and interpretability.
  • Figure 2: Overview of data generation with AudioCopilot. AudioCopilot seeds its dialogue simulation with a text string and performs tool-calling to render and mix each separate sound source.
  • Figure 3: Model architecture overview. Tokens are input through a modality-specific tokenizer, the output of which is concatenated and input into the LM. The tokens are then routed through a modality-specific prediction head.
  • Figure 4: Visualization of Transfusion Forcing: a different diffusion timestep is sampled for each audio span. Audio tokens can attend to all other tokens within a span or in previous spans. Text tokens can only attend to previous text and audio tokens.
  • Figure 5: editFLAM($\uparrow$) and multiFLAM($\downarrow$) results on the three semantic editing tasks in StoryGen-Eval.
  • ...and 5 more figures