Table of Contents
Fetching ...

MELD-ST: An Emotion-aware Speech Translation Dataset

Sirou Chen, Sakiko Yahata, Shuichiro Shimizu, Zhengdong Yang, Yihang Li, Chenhui Chu, Sadao Kurohashi

TL;DR

This work introduces MELD-ST, a dataset for emotion-aware speech translation in English→Japanese and English→German, built from the TV show Friends and annotated with MELD emotion labels. It provides audio, subtitles, timestamps, and emotion annotations for roughly 10k utterances per direction, with careful extraction, cleaning, and alignment procedures. Baseline experiments using SeamlessM4T v2 show that fine-tuning with emotion labels can yield modest translation gains in some S2TT settings, though benefits for S2ST are limited and prosody remains largely unaffected. The dataset aims to spur emotion-aware ST research, while acknowledging limitations such as acted speech and alignment challenges, and points to future work in multitask learning and exploiting dialogue context.

Abstract

Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings, highlighting the need for further research in emotion-aware speech translation systems.

MELD-ST: An Emotion-aware Speech Translation Dataset

TL;DR

This work introduces MELD-ST, a dataset for emotion-aware speech translation in English→Japanese and English→German, built from the TV show Friends and annotated with MELD emotion labels. It provides audio, subtitles, timestamps, and emotion annotations for roughly 10k utterances per direction, with careful extraction, cleaning, and alignment procedures. Baseline experiments using SeamlessM4T v2 show that fine-tuning with emotion labels can yield modest translation gains in some S2TT settings, though benefits for S2ST are limited and prosody remains largely unaffected. The dataset aims to spur emotion-aware ST research, while acknowledging limitations such as acted speech and alignment challenges, and points to future work in multitask learning and exploiting dialogue context.

Abstract

Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings, highlighting the need for further research in emotion-aware speech translation systems.
Paper Structure (20 sections, 9 tables)