Table of Contents
Fetching ...

A Benchmark and Robustness Study of In-Context-Learning with Large Language Models in Music Entity Detection

Simon Hachmeier, Robert Jäschke

TL;DR

This work benchmarks large language models with in-context learning for music-entity detection in user-generated content, introducing the MusicUGC-NER dataset and integrating it with MusicRecoNER for cross-source evaluation. It shows that LLMs with ICL can surpass strong fine-tuned SLM baselines, while also revealing a strong influence of pre-training entity exposure on performance. The authors further develop robustness analyses through cloze-based data synthesis, a factual memorization test, and perturbations to unseen entities, highlighting that exposure effects can dominate perturbation effects in this task. The study points toward combining LLMs with gazetteers or retrieval-augmented generation to improve generalization to unseen music entities and mitigate hallucination in IE settings.

Abstract

Detecting music entities such as song titles or artist names is a useful application to help use cases like processing music search queries or analyzing music consumption on the web. Recent approaches incorporate smaller language models (SLMs) like BERT and achieve high results. However, further research indicates a high influence of entity exposure during pre-training on the performance of the models. With the advent of large language models (LLMs), these outperform SLMs in a variety of downstream tasks. However, researchers are still divided if this is applicable to tasks like entity detection in texts due to issues like hallucination. In this paper, we provide a novel dataset of user-generated metadata and conduct a benchmark and a robustness study using recent LLMs with in-context-learning (ICL). Our results indicate that LLMs in the ICL setting yield higher performance than SLMs. We further uncover the large impact of entity exposure on the best performing LLM in our study.

A Benchmark and Robustness Study of In-Context-Learning with Large Language Models in Music Entity Detection

TL;DR

This work benchmarks large language models with in-context learning for music-entity detection in user-generated content, introducing the MusicUGC-NER dataset and integrating it with MusicRecoNER for cross-source evaluation. It shows that LLMs with ICL can surpass strong fine-tuned SLM baselines, while also revealing a strong influence of pre-training entity exposure on performance. The authors further develop robustness analyses through cloze-based data synthesis, a factual memorization test, and perturbations to unseen entities, highlighting that exposure effects can dominate perturbation effects in this task. The study points toward combining LLMs with gazetteers or retrieval-augmented generation to improve generalization to unseen music entities and mitigate hallucination in IE settings.

Abstract

Detecting music entities such as song titles or artist names is a useful application to help use cases like processing music search queries or analyzing music consumption on the web. Recent approaches incorporate smaller language models (SLMs) like BERT and achieve high results. However, further research indicates a high influence of entity exposure during pre-training on the performance of the models. With the advent of large language models (LLMs), these outperform SLMs in a variety of downstream tasks. However, researchers are still divided if this is applicable to tasks like entity detection in texts due to issues like hallucination. In this paper, we provide a novel dataset of user-generated metadata and conduct a benchmark and a robustness study using recent LLMs with in-context-learning (ICL). Our results indicate that LLMs in the ICL setting yield higher performance than SLMs. We further uncover the large impact of entity exposure on the best performing LLM in our study.

Paper Structure

This paper contains 55 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Relative positions of the utterances per class in D-YT up to the 9th utterance. O refers to the outside tag in the IOB format.
  • Figure 2: Questions of our factual memorization test (FMT) on the example of the musical work Yesterday originally performed by The Beatles.
  • Figure 3: Proportions of errors per group based on our synthesized datasets without perturbation. The total amount is 1,067, the number of all unique clozes.
  • Figure 4: Error proportions per metric per imposed perturbation level and synthesized dataset. The total amount is 1,067, the number of all unique clozes.
  • Figure 5: Cumulative distribution functions of F1 scores per data source on our synthesized dataset.
  • ...and 6 more figures