Table of Contents
Fetching ...

AlignCap: Aligning Speech Emotion Captioning to Human Preferences

Ziqi Liang, Haoxiang Shi, Hanhui Chen

TL;DR

AlignCap is proposed, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: Speech-Text Alignment, which minimizing the divergence between the LLM’s response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization and Human Preference Alignment, where it design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations.

Abstract

Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations. We also extract emotional clues as a prompt for enriching fine-grained information under KD-Regularization. Experiments demonstrate that AlignCap presents stronger performance to other state-of-the-art methods on Zero-shot SEC task.

AlignCap: Aligning Speech Emotion Captioning to Human Preferences

TL;DR

AlignCap is proposed, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: Speech-Text Alignment, which minimizing the divergence between the LLM’s response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization and Human Preference Alignment, where it design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations.

Abstract

Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations. We also extract emotional clues as a prompt for enriching fine-grained information under KD-Regularization. Experiments demonstrate that AlignCap presents stronger performance to other state-of-the-art methods on Zero-shot SEC task.

Paper Structure

This paper contains 16 sections, 6 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Hallucination and lack of generalization.
  • Figure 2: Results of different alignment methods.
  • Figure 3: T-SNE visualizations of LLM's output from speech and text input. (a) Align before LLM Decoding. (b) Align after LLM Decoding.
  • Figure 4: The framework of AlignCap. Left: Illustration of Knowledge Distillation Regularization. Acoustic prompt P$_{\mathrm{act}}$ is generated from emotional clues, which is extracted by an emotion grammar parser G$_{\mathrm{parser}}$. Semantic prompt P$_{\mathrm{sem}}$ is generated from LLM tokenizer. Right: Illustration of Preference Optimization Regularization.
  • Figure 5: Scoring prompt for candidate responses.
  • ...and 4 more figures