Table of Contents
Fetching ...

SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing

Jianyi Chen, Rongxiu Zhong, Shilei Zhang, Kun Qian, Jinglei Liu, Yike Guo, Wei Xue

Abstract

Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.

SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing

Abstract

Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.
Paper Structure (20 sections, 3 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 3 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: The SqueezeComposer illustration: audio is compressed via time-scale modification, generated in a compact domain for various tasks (e.g., long-form composition, accompaniment), and then expanded and refined to the original resolution.
  • Figure 2: Training pipeline for temporal speeding-up and restoration. The input audio is speeded up and processed through a two-stage pipeline: CNN generates a prior condition, then DiT refines it to produce high-quality restored audio. The accelerated audio maintains the original sampling rate, ensuring vocoder compatibility. Training uses MSE loss for CNN prior generation and diffusion loss for DiT refinement.
  • Figure 3: Overview of the frameworks using SqueezeComposer for composing. (A) Long-form Music Composing: temporal speed-up enables efficient abstract-level generation using DiT with masking strategies (scratch, completion, continuation). (B) Whole-song Singing Accompaniment Generation: semantic-to-prior mapping followed by DiT refinement for harmonious accompaniment.
  • Figure 4: Audio duration distribution across four music datasets. All datasets show concentration in the 3-10 minute range, demonstrating that the majority of music files fall within the long-form category, with peaks around 5-7 minutes depending on the dataset type.
  • Figure 5: Visualization of squeezing and ($\times$4) restoration results on two example samples. From top to bottom: the temporally squeezed Mel spectrogram, the original ground-truth Mel spectrogram, the restored Mel spectrogram using a pretrained vocoder, and the corresponding restoration error; the restored Mel spectrogram using a vocoder fine-tuned on restored Mel spectrogram, and its corresponding error.
  • ...and 1 more figures