Table of Contents
Fetching ...

Can Impressions of Music be Extracted from Thumbnail Images?

Takashi Harada, Takehiro Motomitsu, Katsuhiko Hayashi, Yusuke Sakai, Hidetaka Kamigaito

TL;DR

The paper addresses the lack of public music caption data that include non-musical aspects such as listening situations, times, and emotions. It proposes a thumbnail-driven pipeline that leverages a Large Vision-Language Model to generate captions with non-musical content, using prompting strategies that separate image features from non-musical aspects. A large dataset of approximately 360,000 caption pairs is released and used to train a music retrieval model, with human evaluations validating the quality of non-musical content. The results indicate thumbnail-derived captions can capture impressions beyond musical features and improve retrieval performance aligned with user context.

Abstract

In recent years, there has been a notable increase in research on machine learning models for music retrieval and generation systems that are capable of taking natural language sentences as inputs. However, there is a scarcity of large-scale publicly available datasets, consisting of music data and their corresponding natural language descriptions known as music captions. In particular, non-musical information such as suitable situations for listening to a track and the emotions elicited upon listening is crucial for describing music. This type of information is underrepresented in existing music caption datasets due to the challenges associated with extracting it directly from music data. To address this issue, we propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images, and validated the effectiveness of our approach through human evaluations. Additionally, we created a dataset with approximately 360,000 captions containing non-musical aspects. Leveraging this dataset, we trained a music retrieval model and demonstrated its effectiveness in music retrieval tasks through evaluation.

Can Impressions of Music be Extracted from Thumbnail Images?

TL;DR

The paper addresses the lack of public music caption data that include non-musical aspects such as listening situations, times, and emotions. It proposes a thumbnail-driven pipeline that leverages a Large Vision-Language Model to generate captions with non-musical content, using prompting strategies that separate image features from non-musical aspects. A large dataset of approximately 360,000 caption pairs is released and used to train a music retrieval model, with human evaluations validating the quality of non-musical content. The results indicate thumbnail-derived captions can capture impressions beyond musical features and improve retrieval performance aligned with user context.

Abstract

In recent years, there has been a notable increase in research on machine learning models for music retrieval and generation systems that are capable of taking natural language sentences as inputs. However, there is a scarcity of large-scale publicly available datasets, consisting of music data and their corresponding natural language descriptions known as music captions. In particular, non-musical information such as suitable situations for listening to a track and the emotions elicited upon listening is crucial for describing music. This type of information is underrepresented in existing music caption datasets due to the challenges associated with extracting it directly from music data. To address this issue, we propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images, and validated the effectiveness of our approach through human evaluations. Additionally, we created a dataset with approximately 360,000 captions containing non-musical aspects. Leveraging this dataset, we trained a music retrieval model and demonstrated its effectiveness in music retrieval tasks through evaluation.
Paper Structure (17 sections, 3 figures, 6 tables)

This paper contains 17 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of the music captioning method using thumbnail images.
  • Figure 2: Details of the proposed music captioning method and the methods used for comparison.
  • Figure 3: Prompt 1: prompt to generate music captions from thumbnail images.