Table of Contents
Fetching ...

Contrastive Distillation of Emotion Knowledge from LLMs for Zero-Shot Emotion Recognition

Minxue Niu, Emily Mower Provost

TL;DR

The study targets flexible emotion recognition across unseen label spaces without extra annotation by distilling rich GPT-4 generated emotion descriptors into a compact BERT-sized model. It uses a CLIP-inspired contrastive learning objective with an alignment matrix $\mathbf{M} = \mathbf{T} \mathbf{L}^{\top} / \tau$ and a sigmoid-based loss to align text and descriptor embeddings in a shared emotion space. Key contributions include (1) generating rich descriptive emotion annotations with GPT-4, (2) a multi-label contrastive distillation framework that enables zero-shot inference across varied label schemas, and (3) comprehensive evaluation on GoEmotions, SemEval, ISEAR, and EmoBank along with ablations and emotion-space probing. The results show strong zero-shot performance, competitive with GPT-4 on multi-label tasks and robust valence regression, while maintaining a model size suitable for edge deployment. This work advances practical, adaptable ER systems that respect resource constraints and diverse downstream label schemas.

Abstract

The ability to handle various emotion labels without dedicated training is crucial for building adaptable Emotion Recognition (ER) systems. Conventional ER models rely on training using fixed label sets and struggle to generalize beyond them. On the other hand, Large Language Models (LLMs) have shown strong zero-shot ER performance across diverse label spaces, but their scale limits their use on edge devices. In this work, we propose a contrastive distillation framework that transfers rich emotional knowledge from LLMs into a compact model without the use of human annotations. We use GPT-4 to generate descriptive emotion annotations, offering rich supervision beyond fixed label sets. By aligning text samples with emotion descriptors in a shared embedding space, our method enables zero-shot prediction on different emotion classes, granularity, and label schema. The distilled model is effective across multiple datasets and label spaces, outperforming strong baselines of similar size and approaching GPT-4's zero-shot performance, while being over 10,000 times smaller.

Contrastive Distillation of Emotion Knowledge from LLMs for Zero-Shot Emotion Recognition

TL;DR

The study targets flexible emotion recognition across unseen label spaces without extra annotation by distilling rich GPT-4 generated emotion descriptors into a compact BERT-sized model. It uses a CLIP-inspired contrastive learning objective with an alignment matrix and a sigmoid-based loss to align text and descriptor embeddings in a shared emotion space. Key contributions include (1) generating rich descriptive emotion annotations with GPT-4, (2) a multi-label contrastive distillation framework that enables zero-shot inference across varied label schemas, and (3) comprehensive evaluation on GoEmotions, SemEval, ISEAR, and EmoBank along with ablations and emotion-space probing. The results show strong zero-shot performance, competitive with GPT-4 on multi-label tasks and robust valence regression, while maintaining a model size suitable for edge deployment. This work advances practical, adaptable ER systems that respect resource constraints and diverse downstream label schemas.

Abstract

The ability to handle various emotion labels without dedicated training is crucial for building adaptable Emotion Recognition (ER) systems. Conventional ER models rely on training using fixed label sets and struggle to generalize beyond them. On the other hand, Large Language Models (LLMs) have shown strong zero-shot ER performance across diverse label spaces, but their scale limits their use on edge devices. In this work, we propose a contrastive distillation framework that transfers rich emotional knowledge from LLMs into a compact model without the use of human annotations. We use GPT-4 to generate descriptive emotion annotations, offering rich supervision beyond fixed label sets. By aligning text samples with emotion descriptors in a shared embedding space, our method enables zero-shot prediction on different emotion classes, granularity, and label schema. The distilled model is effective across multiple datasets and label spaces, outperforming strong baselines of similar size and approaching GPT-4's zero-shot performance, while being over 10,000 times smaller.

Paper Structure

This paper contains 23 sections, 1 equation, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Our model is trained with rich emotion descriptors generated by GPT-4. During inference, this much smaller model can flexibly perform classification or regression on new label spaces.
  • Figure 2: Overview of our contrastive distillation model structure.