Table of Contents
Fetching ...

Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: a Survey

Dinh-Viet-Toan Le, Louis Bigo, Mikaela Keller, Dorien Herremans

TL;DR

The survey addresses the question of how NLP methods can be repurposed for symbolic music by organizing developments around representations and model architectures. It catalogues tokenization strategies (time-slice vs event-based, elementary vs composite tokens) and vectorization approaches, then details a spectrum of NLP-inspired models from recurrent nets to Transformer-based, encoder/decoder, and multimodal configurations, including end-to-end and pre-trained paradigms. The authors discuss technical limitations, such as data scarcity and heterogeneous musical alphabets, and contrast structural differences between language and music, advocating for lighter, explainable, and benchmark-driven progress. Collectively, the work provides a structured roadmap for adapting NLP tools to symbolic MIR, highlighting practical implications for generation, retrieval, and cross-modal tasks while outlining essential directions for future research and evaluation frameworks.

Abstract

Several adaptations of Transformers models have been developed in various domains since its breakthrough in Natural Language Processing (NLP). This trend has spread into the field of Music Information Retrieval (MIR), including studies processing music data. However, the practice of leveraging NLP tools for symbolic music data is not novel in MIR. Music has been frequently compared to language, as they share several similarities, including sequential representations of text and music. These analogies are also reflected through similar tasks in MIR and NLP. This survey reviews NLP methods applied to symbolic music generation and information retrieval studies following two axes. We first propose an overview of representations of symbolic music adapted from natural language sequential representations. Such representations are designed by considering the specificities of symbolic music. These representations are then processed by models. Such models, possibly originally developed for text and adapted for symbolic music, are trained on various tasks. We describe these models, in particular deep learning models, through different prisms, highlighting music-specialized mechanisms. We finally present a discussion surrounding the effective use of NLP tools for symbolic music data. This includes technical issues regarding NLP methods and fundamental differences between text and music, which may open several doors for further research into more effectively adapting NLP tools to symbolic MIR.

Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: a Survey

TL;DR

The survey addresses the question of how NLP methods can be repurposed for symbolic music by organizing developments around representations and model architectures. It catalogues tokenization strategies (time-slice vs event-based, elementary vs composite tokens) and vectorization approaches, then details a spectrum of NLP-inspired models from recurrent nets to Transformer-based, encoder/decoder, and multimodal configurations, including end-to-end and pre-trained paradigms. The authors discuss technical limitations, such as data scarcity and heterogeneous musical alphabets, and contrast structural differences between language and music, advocating for lighter, explainable, and benchmark-driven progress. Collectively, the work provides a structured roadmap for adapting NLP tools to symbolic MIR, highlighting practical implications for generation, retrieval, and cross-modal tasks while outlining essential directions for future research and evaluation frameworks.

Abstract

Several adaptations of Transformers models have been developed in various domains since its breakthrough in Natural Language Processing (NLP). This trend has spread into the field of Music Information Retrieval (MIR), including studies processing music data. However, the practice of leveraging NLP tools for symbolic music data is not novel in MIR. Music has been frequently compared to language, as they share several similarities, including sequential representations of text and music. These analogies are also reflected through similar tasks in MIR and NLP. This survey reviews NLP methods applied to symbolic music generation and information retrieval studies following two axes. We first propose an overview of representations of symbolic music adapted from natural language sequential representations. Such representations are designed by considering the specificities of symbolic music. These representations are then processed by models. Such models, possibly originally developed for text and adapted for symbolic music, are trained on various tasks. We describe these models, in particular deep learning models, through different prisms, highlighting music-specialized mechanisms. We finally present a discussion surrounding the effective use of NLP tools for symbolic music data. This includes technical issues regarding NLP methods and fundamental differences between text and music, which may open several doors for further research into more effectively adapting NLP tools to symbolic MIR.
Paper Structure (43 sections, 5 figures)

This paper contains 43 sections, 5 figures.

Figures (5)

  • Figure 1: An oversimplified example of segmentation levels in text and symbolic music. Such segmentations can, however, include more or less fine-grained levels and their delimitations can be ambiguous (Section \ref{['sec:discussion']}).
  • Figure 2: Evolution of the number of articles containing NLP-related words. (Left) Number of ISMIR papers containing NLP-related words in their abstracts from 2000 to 2023. (Right) Number of arXiv preprints returned by the API query "music AND <term>".
  • Figure 3: Overview of the survey, organized around two axes: similarities and specificities of NLP and MIR tasks motivating representations of symbolic music inspired by NLP and models adapted from NLP for symbolic music.
  • Figure 4: Taxonomy of tokenizations for symbolic music. Tokens are either based on regular time-slices or events. Among event-based tokenization strategies, tokens encode various features of these events: composite (or multidimensional) tokens encapsulate all these features in a single token, in contrast with elementary tokens where each musical feature is processed one after the other.
  • Figure 5: Artificial sequentiality possibly introduced in a tokenization strategy. By restraining the attributes of a note to pitch, duration, velocity, and time-shift, the sequentiality of the blocks (black dashed blocks) follows the temporality, but the order of the inner musical features is arbitrary. The sequentiality of these blocks can even be artificial for simultaneous events (red dashed block).