Table of Contents
Fetching ...

Do Captioning Metrics Reflect Music Semantic Alignment?

Jinwoo Lee, Kyogu Lee

TL;DR

Cases where traditional metrics are vulnerable to syntactic changes, and show they do not correlate well with human judgments are presented, highlighting the need for a critical reevaluation of how music captions are assessed.

Abstract

Music captioning has emerged as a promising task, fueled by the advent of advanced language generation models. However, the evaluation of music captioning relies heavily on traditional metrics such as BLEU, METEOR, and ROUGE which were developed for other domains, without proper justification for their use in this new field. We present cases where traditional metrics are vulnerable to syntactic changes, and show they do not correlate well with human judgments. By addressing these issues, we aim to emphasize the need for a critical reevaluation of how music captions are assessed.

Do Captioning Metrics Reflect Music Semantic Alignment?

TL;DR

Cases where traditional metrics are vulnerable to syntactic changes, and show they do not correlate well with human judgments are presented, highlighting the need for a critical reevaluation of how music captions are assessed.

Abstract

Music captioning has emerged as a promising task, fueled by the advent of advanced language generation models. However, the evaluation of music captioning relies heavily on traditional metrics such as BLEU, METEOR, and ROUGE which were developed for other domains, without proper justification for their use in this new field. We present cases where traditional metrics are vulnerable to syntactic changes, and show they do not correlate well with human judgments. By addressing these issues, we aim to emphasize the need for a critical reevaluation of how music captions are assessed.

Paper Structure

This paper contains 6 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Experimental results on scores across caption type and evaluation metrics from MusicCaps evaluation set. MOS values are rescaled to [0, 1].
  • Figure :