SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Ayush Barik; Sofia Stoica; Nikhil Sarda; Arnav Kethana; Abhinav Khanduja; Muchen Xu; Fan Lai

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Ayush Barik, Sofia Stoica, Nikhil Sarda, Arnav Kethana, Abhinav Khanduja, Muchen Xu, Fan Lai

TL;DR

SoundWeaver is presented, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio by introducing three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating.

Abstract

Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0$ \times $ latency reduction with a cache of only ${\sim}$1K entries while preserving or improving perceptual quality.

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

TL;DR

Abstract

latency reduction with a cache of only

1K entries while preserving or improving perceptual quality.

Paper Structure (11 sections, 7 equations, 3 figures, 2 tables)

This paper contains 11 sections, 7 equations, 3 figures, 2 tables.

Introduction
Methods
Reference Selector
Skip Gater
Cache Manager
Evaluation
Main Results
Ablation Studies
Conclusion
Acknowledgments
Generative AI Use Disclosure

Figures (3)

Figure 1: Distribution of CLAP scores for nearest-neighbor retrievals across AudioCaps prompts.
Figure 2: SoundWeaver overview and request execution flow.
Figure 3: Serving latency in online deployments. $SW^\dagger$ leverages synthetic-audio cache; $SW^\ddagger$ uses real-audio cache.

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

TL;DR

Abstract

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Authors

TL;DR

Abstract

Table of Contents

Figures (3)