Table of Contents
Fetching ...

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

TL;DR

SoundCTM introduces a unified text-to-sound generation framework that blends fast $1$-step generation with high-quality deterministic multi-step refinement by reframing Consistency Trajectory Models (CTMs) for sound. It advances training with a novel teacher-feature distillation loss, CFG trajectory handling, and $\nu$-sampling to blend conditional and unconditional trajectories. The 1B-scale SoundCTM-DiT-1B demonstrates competitive full-band (44.1 kHz) performance in both 1-step and multi-step regimes and enables deterministic sampling to preserve semantic content during refinement. This approach offers a practical, production-ready pathway for trial-and-refinement workflows in sound design and audio content creation, with potential for controllable generation via loss-based guidance. The codebase is released to support reproducibility and further research in scalable distillation for sound synthesis.

Abstract

Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these models often suffer from slow inference speeds, imposing an undesirable burden that hinders the trial-and-error process. While existing T2S distillation models address this limitation through 1-step generation, the sample quality of $1$-step generation remains insufficient for production use. Additionally, while multi-step sampling in those distillation models improves sample quality itself, the semantic content changes due to their lack of deterministic sampling capabilities. To address these issues, we introduce Sound Consistency Trajectory Models (SoundCTM), which allow flexible transitions between high-quality $1$-step sound generation and superior sound quality through multi-step deterministic sampling. This allows creators to efficiently conduct trial-and-error with 1-step generation to semantically align samples with their intention, and subsequently refine sample quality with preserving semantic content through deterministic multi-step sampling. To develop SoundCTM, we reframe the CTM training framework, originally proposed in computer vision, and introduce a novel feature distance using the teacher network for a distillation loss. For production-level generation, we scale up our model to 1B trainable parameters, making SoundCTM-DiT-1B the first large-scale distillation model in the sound community to achieve both promising high-quality 1-step and multi-step full-band (44.1kHz) generation.

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

TL;DR

SoundCTM introduces a unified text-to-sound generation framework that blends fast -step generation with high-quality deterministic multi-step refinement by reframing Consistency Trajectory Models (CTMs) for sound. It advances training with a novel teacher-feature distillation loss, CFG trajectory handling, and -sampling to blend conditional and unconditional trajectories. The 1B-scale SoundCTM-DiT-1B demonstrates competitive full-band (44.1 kHz) performance in both 1-step and multi-step regimes and enables deterministic sampling to preserve semantic content during refinement. This approach offers a practical, production-ready pathway for trial-and-refinement workflows in sound design and audio content creation, with potential for controllable generation via loss-based guidance. The codebase is released to support reproducibility and further research in scalable distillation for sound synthesis.

Abstract

Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these models often suffer from slow inference speeds, imposing an undesirable burden that hinders the trial-and-error process. While existing T2S distillation models address this limitation through 1-step generation, the sample quality of -step generation remains insufficient for production use. Additionally, while multi-step sampling in those distillation models improves sample quality itself, the semantic content changes due to their lack of deterministic sampling capabilities. To address these issues, we introduce Sound Consistency Trajectory Models (SoundCTM), which allow flexible transitions between high-quality -step sound generation and superior sound quality through multi-step deterministic sampling. This allows creators to efficiently conduct trial-and-error with 1-step generation to semantically align samples with their intention, and subsequently refine sample quality with preserving semantic content through deterministic multi-step sampling. To develop SoundCTM, we reframe the CTM training framework, originally proposed in computer vision, and introduce a novel feature distance using the teacher network for a distillation loss. For production-level generation, we scale up our model to 1B trainable parameters, making SoundCTM-DiT-1B the first large-scale distillation model in the sound community to achieve both promising high-quality 1-step and multi-step full-band (44.1kHz) generation.
Paper Structure (52 sections, 16 equations, 13 figures, 15 tables, 3 algorithms)

This paper contains 52 sections, 16 equations, 13 figures, 15 tables, 3 algorithms.

Figures (13)

  • Figure 1: SoundCTM-DiT-1B is first model that achieves high-quality $1$-step and higher-quality multi-step full-band T2S generation while preserving semantic content through deterministic sampling, enabling creators to efficiently carry out the trial-and-refinement creation process within a single model.
  • Figure 2: Illustrations of SoundCTM's two predictions $\mathbf{z}_{\text{target}}$ and $\mathbf{z}_{\text{est}}$ at time $s$ with an initial value $\mathbf{z}_t$ and the feature extraction by the teacher's network for the CTM loss shown within the blue ellipse area. All the parameters of the teacher's network are frozen. The conditional embedding $\mathbf{c}$ and time $s$ are also input to the feature extractor. Note that the teacher's network does not need to be the UNet architecture Ronneberger2015unet.
  • Figure 3: Visualization of spectrograms of generated samples using $1$-step, $2$-step, and $4$-step generation with ConsistencyTTA, AudioLCM, and SoundCTM.
  • Figure 4: Visualization of spectrograms of generated samples by SoundCTM-DiT-1B using $1$-step, $4$-step, and $16$-step generation with stochastic ($\gamma=0.5$) and deterministic ($\gamma=0$) sampling.
  • Figure 5: Influence of $\nu$ on SoundCTM-DiT-1B with $8$-step sampling. Darker colors indicate better scores.
  • ...and 8 more figures