Table of Contents
Fetching ...

BLSP-Emo: Towards Empathetic Large Speech-Language Models

Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang

TL;DR

BLSP-Emo presents an empathetic large speech-language model built by bootstrapping an instruction-following LLM with ASR data for semantic alignment and SER data for emotion alignment. The method uses a three-component architecture (speech encoder, LLM, and modality adapter) and a two-stage training pipeline that first aligns semantic content and then aligns emotion cues from speech to text generation. Across SER benchmarks, empathetic response tasks, and multi-turn conversations, BLSP-Emo achieves state-of-the-art or near-state-of-the-art results and demonstrates strong cross-language generalization, indicating effective integration of linguistic content and paralinguistic cues. While promising, the work relies on synthetic emotion-data exemplars for evaluation and uses a limited emotion taxonomy; future work should broaden paralinguistic cues and language coverage for real-world deployments.

Abstract

The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.

BLSP-Emo: Towards Empathetic Large Speech-Language Models

TL;DR

BLSP-Emo presents an empathetic large speech-language model built by bootstrapping an instruction-following LLM with ASR data for semantic alignment and SER data for emotion alignment. The method uses a three-component architecture (speech encoder, LLM, and modality adapter) and a two-stage training pipeline that first aligns semantic content and then aligns emotion cues from speech to text generation. Across SER benchmarks, empathetic response tasks, and multi-turn conversations, BLSP-Emo achieves state-of-the-art or near-state-of-the-art results and demonstrates strong cross-language generalization, indicating effective integration of linguistic content and paralinguistic cues. While promising, the work relies on synthetic emotion-data exemplars for evaluation and uses a limited emotion taxonomy; future work should broaden paralinguistic cues and language coverage for real-world deployments.

Abstract

The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.
Paper Structure (32 sections, 3 equations, 3 figures, 8 tables)

This paper contains 32 sections, 3 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Illustrative example of an empathetic large language model responding to speeches with identical linguistic content but different emotional tones.
  • Figure 2: Overview of the BLSP-Emo approach. In the first step, an LLM generates emotion-aware text continuations using speech transcripts and emotion labels as inputs. These generated continuations serve as supervisions to train the model in the second step, where the corresponding speech is used as input. Differences in the prompts used during data construction and the training stage are highlighted in red font.
  • Figure 3: Results on multi-turn conversation.