Table of Contents
Fetching ...

SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation

Hongrui Wang, Fan Zhang, Zhiyuan Yu, Ziya Zhou, Xi Chen, Can Yang, Yang Wang

TL;DR

Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD).

Abstract

Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other unique features. Additionally, we enhance the evaluation of multi-track music quality by introducing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD). Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency.

SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation

TL;DR

Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD).

Abstract

Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other unique features. Additionally, we enhance the evaluation of multi-track music quality by introducing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD). Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency.
Paper Structure (26 sections, 8 equations, 12 figures, 12 tables)

This paper contains 26 sections, 8 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: (a) Previous methods mariani2023multikarchkhadze2025simultaneous leverage a unified model to learn the joint distribution of multi-track audio stems. (b) While our proposed SyncTrack incorporates both track-shared modules and track-specific modules for common and specific information between tracks.
  • Figure 2: a. Overall pipeline for SyncTrack. Training pipeline: We train a four-track latent diffusion model. Each track is perturbed based on $l$-th signal-to-noise ratio. The model is optimized to predict the added noise $\epsilon\in \mathcal{N}(0,I)$. More details are in Section 3.1. Inference pipeline: At test time, four-track latents are generated and then decoded into audio data. b. SyncTrack consists of input, mid, and output blocks, which contains track-specific modules and track-shared modules.
  • Figure 3: Illustration of the (a) track-shared module and (b) track-specific module. In (a), we leverage inner-track attention to capture the inner-track rhythmic stability and devise (c) two cross-track attention submodules to capture cross-track rhythmic stability and synchronization. In (b), we construct a learnable instrument prior to capture timbre and other track-specific features.
  • Figure 4: Comparison of subjective ratings and objective metric scores.
  • Figure A1: Comparison of IRS across hyperparameter settings,
  • ...and 7 more figures