Table of Contents
Fetching ...

Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation

Jiwoo Ryu, Hao-Wen Dong, Jongmin Jung, Dasaem Jeong

TL;DR

The paper tackles long sequence challenges in symbolic music and discrete audio token generation by introducing the Nested Music Transformer (NMT), a memory-efficient autoregressive framework that decodes compound tokens via a main decoder and a sub-decoder with cross-attention. It couples a Note-based (NB) compound-token encoding with a Compound Shift strategy to better model interdependencies among sub-tokens, enhanced by an Embedding Enricher that contextualizes sub-token embeddings. Empirical results show that NMT, especially with cross-attention and Embedding Enricher, achieves competitive or superior negative log-likelihood on symbolic music across encodings and improves objective metrics for discrete audio tokens, with subjective listening tests corroborating comparable perceptual quality to strong baselines. The approach reduces memory usage and training time while remaining effective for both symbolic and audio-token generation, and is extended to EnCodec-ed MAESTRO for discrete audio token challenges, demonstrating practical impact for scalable music generation systems.

Abstract

Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. The NMT consists of two transformers: the main decoder that models a sequence of compound tokens and the sub-decoder for modeling sub-tokens of each compound token. The experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset.

Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation

TL;DR

The paper tackles long sequence challenges in symbolic music and discrete audio token generation by introducing the Nested Music Transformer (NMT), a memory-efficient autoregressive framework that decodes compound tokens via a main decoder and a sub-decoder with cross-attention. It couples a Note-based (NB) compound-token encoding with a Compound Shift strategy to better model interdependencies among sub-tokens, enhanced by an Embedding Enricher that contextualizes sub-token embeddings. Empirical results show that NMT, especially with cross-attention and Embedding Enricher, achieves competitive or superior negative log-likelihood on symbolic music across encodings and improves objective metrics for discrete audio tokens, with subjective listening tests corroborating comparable perceptual quality to strong baselines. The approach reduces memory usage and training time while remaining effective for both symbolic and audio-token generation, and is extended to EnCodec-ed MAESTRO for discrete audio token challenges, demonstrating practical impact for scalable music generation systems.

Abstract

Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. The NMT consists of two transformers: the main decoder that models a sequence of compound tokens and the sub-decoder for modeling sub-tokens of each compound token. The experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset.
Paper Structure (23 sections, 5 equations, 4 figures, 3 tables)

This paper contains 23 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Diagram of the nested architecture with three different methods for predicting sub-tokens.
  • Figure 2: An example illustrating the proposed representations, note-based (NB) encoding (c) NB-Metric1st and (d) NB-Pitch1st, alongside REMI and Compound word. All encodings represent the same piece of music by using five musical features. Specifically, REMI and Compound word were not originally designed for multi-instrument pieces, which is why we renamed the encodings with "+I" to (a) and (b). Here, $k$ denotes the number of notes and sequence length for NB, while $r$ and $c$ represent the ratios for REMI and Compound word, with values greater than 1.
  • Figure 3: Illustrations of the proposed Nested Music Transformer (NMT) and other sub-decoder structures
  • Figure :