Table of Contents
Fetching ...

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

Weidong Chen, Xiaofen Xing, Peihao Chen, Xiangmin Xu

TL;DR

This work tackles the inefficiency and task-agnostic nature of large pretrained models for speech emotion recognition by proposing Vesper, a compact, task-specific encoder derived from WavLM Large via initialization compression and emotion-centered self-supervision. The method introduces an emotion-guided masking strategy, hierarchical self-supervision to capture acoustic and semantic cues, and cross-layer self-supervision to enrich the final representation. Empirical results on IEMOCAP, MELD, and CREMA-D show that Vesper-4 outperforms WavLM Base and Vesper-12 matches or closely approaches WavLM Large, with substantial reductions in model size and computation. Ablation studies confirm the contributions of each component and demonstrate the approach's robustness across downstream models and even across backbone PTMs like HuBERT, underscoring its practical impact for building task-specific, efficient SER systems.

Abstract

This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

TL;DR

This work tackles the inefficiency and task-agnostic nature of large pretrained models for speech emotion recognition by proposing Vesper, a compact, task-specific encoder derived from WavLM Large via initialization compression and emotion-centered self-supervision. The method introduces an emotion-guided masking strategy, hierarchical self-supervision to capture acoustic and semantic cues, and cross-layer self-supervision to enrich the final representation. Empirical results on IEMOCAP, MELD, and CREMA-D show that Vesper-4 outperforms WavLM Base and Vesper-12 matches or closely approaches WavLM Large, with substantial reductions in model size and computation. Ablation studies confirm the contributions of each component and demonstrate the approach's robustness across downstream models and even across backbone PTMs like HuBERT, underscoring its practical impact for building task-specific, efficient SER systems.

Abstract

This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.
Paper Structure (33 sections, 14 equations, 6 figures, 12 tables)

This paper contains 33 sections, 14 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: (a) Compressing large-scale pretrained model by knowledge distillation. (b) Adapting a large-scale pretrained model to a downstream task by labeled task-related data. (c) Our pipeline simultaneously applies compression and label-free adaptation to generate a task-specific pretrained model. KD, Comp., Init., and Obj. stand for knowledge distillation, compression, initialization, and objective, respectively. Circles of different colors represent models specialized for different spaces or tasks.
  • Figure 2: The proposed paradigm for generating a task-specific pretrained model that is both compact and effective based on a large-scale pretrained model. The paradigm consists of two steps: compression and task-specific pretraining.
  • Figure 3: Two types of compression approaches. The dashed line in (b) represents copying parameters from WavLM Large directly to Vesper. In this paper, we use approach (b) for initialization.
  • Figure 4: The task-specific self-supervised pretraining strategy of the proposed Vesper, which mainly consists of emotion-guided masking strategy, hierarchical self-supervision ($L_l$ and $L_h$), and cross-layer self-supervision ($L_x$). Raw audio samples are used as inputs. $N$ and $M$ ($N < M$) denote the number of Transformer layers employed in Vesper and WavLM Large, respectively.
  • Figure 5: The emotion-guided masking strategy used for pretraining Vesper. Only regions with a high probability of containing emotional information are masked. The red triangle indicates the position of the mask center.
  • ...and 1 more figures