Table of Contents
Fetching ...

MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling

Yifan Cheng, Ruoyi Zhang, Jiatong Shi

TL;DR

The paper tackles data scarcity for emotional speech by introducing MIKU-PAL, a fully automated multimodal pipeline that labels emotions from unlabeled video using audio, visual, and text cues. It combines audio source separation (MDX-Net), face detection (S$^3$FD/DSFD), active speaker identification (TalkNet), and MLLM-based emotion analysis (Gemini 2.0) to deliver high-consistency annotations across 26 emotion categories. The approach achieves human-level or near-human performance with Fleiss' kappa around 0.93–0.95 and 68.5% MELD accuracy, while significantly reducing cost and time to collect labeled data; it also introduces MIKU-EmoBench, a 131.2-hour dataset spanning 26 emotions for emotional TTS and cloning. This work demonstrates the viability of scalable, fine-grained emotional labeling from web data, enabling advances in emotional TTS systems, though it acknowledges potential biases from source data and model-dependence, outlining avenues for robustness and bias mitigation.

Abstract

Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset MIKU-EmoBench(131.2 hours) as a new benchmark for emotional text-to-speech and visual voice cloning.

MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling

TL;DR

The paper tackles data scarcity for emotional speech by introducing MIKU-PAL, a fully automated multimodal pipeline that labels emotions from unlabeled video using audio, visual, and text cues. It combines audio source separation (MDX-Net), face detection (SFD/DSFD), active speaker identification (TalkNet), and MLLM-based emotion analysis (Gemini 2.0) to deliver high-consistency annotations across 26 emotion categories. The approach achieves human-level or near-human performance with Fleiss' kappa around 0.93–0.95 and 68.5% MELD accuracy, while significantly reducing cost and time to collect labeled data; it also introduces MIKU-EmoBench, a 131.2-hour dataset spanning 26 emotions for emotional TTS and cloning. This work demonstrates the viability of scalable, fine-grained emotional labeling from web data, enabling advances in emotional TTS systems, though it acknowledges potential biases from source data and model-dependence, outlining avenues for robustness and bias mitigation.

Abstract

Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset MIKU-EmoBench(131.2 hours) as a new benchmark for emotional text-to-speech and visual voice cloning.

Paper Structure

This paper contains 10 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The structure overview of MIKU-PAL: It analyzes visual, text, and audio modalities across three stages.
  • Figure 2: Overview of the MLLM chat. The system prompt is based on three parts: mission description, textual description of emotions, and output structure. The user prompt only contains raw video and text. Example output presents a representative example of the system output.
  • Figure 3: Mixed emotion analysis on 10,000 YouTube video segments using MIKU-PAL. Annotation results are reduced to two dimensions using t-SNE. Each data point is labeled according to the emotion category with the highest intensity and colored using a weighted interpolation based on all emotion categories present. The visualization demonstrates MIKU-PAL's ability to model the continuous human emotion space and the gradient relationships between emotion categories.
  • Figure 4: Confusion matrix of MELD (left) and IEMOCAP (right). It demonstrates good performance in emotions that have been psychologically validated.