Table of Contents
Fetching ...

I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition

Yannis Vasilakis, Rachel Bittner, Johan Pauwels

TL;DR

This work interrogates zero-shot instrument recognition in two-tower audio-text systems, focusing on how pre-joint and joint embeddings behave across three models (MusCALL, Music CLAP, LAION-CLAP) when trained and evaluated on TinySOL. It reveals strong audio-encoder performance but weaknesses in the text encoder and the joint projection, with pronounced sensitivity to prompts and limited use of contextual information. A novel ontology-based metric demonstrates shallow semantic understanding of instruments in the textual space, underscoring the need for music-focused fine-tuning or alternative mapping strategies. The findings guide future directions toward music-informed text representations and datasets, aiming to improve cross-modal generalization and zero-shot instrument recognition.

Abstract

Music two-tower multimodal systems integrate audio and text modalities into a joint audio-text space, enabling direct comparison between songs and their corresponding labels. These systems enable new approaches for classification and retrieval, leveraging both modalities. Despite the promising results they have shown for zero-shot classification and retrieval tasks, closer inspection of the embeddings is needed. This paper evaluates the inherent zero-shot properties of joint audio-text spaces for the case-study of instrument recognition. We present an evaluation and analysis of two-tower systems for zero-shot instrument recognition and a detailed analysis of the properties of the pre-joint and joint embeddings spaces. Our findings suggest that audio encoders alone demonstrate good quality, while challenges remain within the text encoder or joint space projection. Specifically, two-tower systems exhibit sensitivity towards specific words, favoring generic prompts over musically informed ones. Despite the large size of textual encoders, they do not yet leverage additional textual context or infer instruments accurately from their descriptions. Lastly, a novel approach for quantifying the semantic meaningfulness of the textual space leveraging an instrument ontology is proposed. This method reveals deficiencies in the systems' understanding of instruments and provides evidence of the need for fine-tuning text encoders on musical data.

I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition

TL;DR

This work interrogates zero-shot instrument recognition in two-tower audio-text systems, focusing on how pre-joint and joint embeddings behave across three models (MusCALL, Music CLAP, LAION-CLAP) when trained and evaluated on TinySOL. It reveals strong audio-encoder performance but weaknesses in the text encoder and the joint projection, with pronounced sensitivity to prompts and limited use of contextual information. A novel ontology-based metric demonstrates shallow semantic understanding of instruments in the textual space, underscoring the need for music-focused fine-tuning or alternative mapping strategies. The findings guide future directions toward music-informed text representations and datasets, aiming to improve cross-modal generalization and zero-shot instrument recognition.

Abstract

Music two-tower multimodal systems integrate audio and text modalities into a joint audio-text space, enabling direct comparison between songs and their corresponding labels. These systems enable new approaches for classification and retrieval, leveraging both modalities. Despite the promising results they have shown for zero-shot classification and retrieval tasks, closer inspection of the embeddings is needed. This paper evaluates the inherent zero-shot properties of joint audio-text spaces for the case-study of instrument recognition. We present an evaluation and analysis of two-tower systems for zero-shot instrument recognition and a detailed analysis of the properties of the pre-joint and joint embeddings spaces. Our findings suggest that audio encoders alone demonstrate good quality, while challenges remain within the text encoder or joint space projection. Specifically, two-tower systems exhibit sensitivity towards specific words, favoring generic prompts over musically informed ones. Despite the large size of textual encoders, they do not yet leverage additional textual context or infer instruments accurately from their descriptions. Lastly, a novel approach for quantifying the semantic meaningfulness of the textual space leveraging an instrument ontology is proposed. This method reveals deficiencies in the systems' understanding of instruments and provides evidence of the need for fine-tuning text encoders on musical data.
Paper Structure (13 sections, 3 equations, 6 figures)

This paper contains 13 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: Figure of a pipeline for two-tower multimodal systems. A separate model for each modality is used and their individual representations are projected to a joint audio-text space through a Multi-Layer Perceptron (MLP). This enables direct comparison between audio and textual data. We refer to embeddings obtained before joint-space projection as pre-joint space embeddings.
  • Figure 2: Metrics for 6 textual prompts (See Section \ref{['subsec:are_two_tower_systems_context_dependent']}), 2 audio based label embeddings (See Section \ref{['subsec:closely_inspecting_the_cosine_similarity_distribution']}) and the 3 two-tower multimodal systems. The top row contains top-1 through top-3 accuracy and the bottom ROC-AUC and PR-AUC. The red line represents random choice.
  • Figure 3: Histograms of audio and label embeddings for positive and negative pairs. When using textual prompts (\ref{['fig:positive_negative_histogram_text']}), the alignment is problematic, as can be seen from the overlap between positive and negative distributions.
  • Figure 4: The histogram of top-2 class similarities for every song in TinySOL. The CLAP models tend to be not very confident while the metrics are greater than the overconfident MusCALL with the worst metrics.
  • Figure 5: Semantic meaningfulness quantification leveraging Henry Doktorski's instrument ontology. We evaluated the systems over valid triplets obtained through TinySOL labels, as well as every available triplet obtained from the ontology's labels. Accuracy ranges from 49-59% which stresses that the models do not properly understand musical instruments in depth.
  • ...and 1 more figures