Table of Contents
Fetching ...

Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement

Zijin Su, Huanzhu Lyu, Yuren Niu, Yiming Liu

TL;DR

Addressing severe class-imbalance in multi-label sentiment classification on the GoEmotions dataset, the work builds a balanced corpus by augmenting the GoEmotions data with Sentiment140 samples labeled by a RoBERTa-go-emotions classifier and 20k GPT-4 mini-generated texts. It introduces a unified CNN–BiLSTM–attention architecture using pre-trained FastText embeddings and a sigmoid multi-label output, with mixed-precision training and per-label thresholds for $28$ emotion categories. The key contributions are (i) a robust data-balancing pipeline that improves minority-emotion recall and F1, and (ii) a lightweight architecture that rivals transformer baselines while reducing compute. The results demonstrate improved performance on multiple metrics and offer practical applicability for fine-grained sentiment monitoring in real-world settings.

Abstract

Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.

Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement

TL;DR

Addressing severe class-imbalance in multi-label sentiment classification on the GoEmotions dataset, the work builds a balanced corpus by augmenting the GoEmotions data with Sentiment140 samples labeled by a RoBERTa-go-emotions classifier and 20k GPT-4 mini-generated texts. It introduces a unified CNN–BiLSTM–attention architecture using pre-trained FastText embeddings and a sigmoid multi-label output, with mixed-precision training and per-label thresholds for emotion categories. The key contributions are (i) a robust data-balancing pipeline that improves minority-emotion recall and F1, and (ii) a lightweight architecture that rivals transformer baselines while reducing compute. The results demonstrate improved performance on multiple metrics and offer practical applicability for fine-grained sentiment monitoring in real-world settings.

Abstract

Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.

Paper Structure

This paper contains 21 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Unbalanced 'GoEmotions' dataset, the figure showing the original dataset witout balanced.
  • Figure 2: balanced 'GoEmotions' dataset, the figure showing the dataset after balance
  • Figure 3: This figure showing the 50 Most frequent word appeared in the dataset
  • Figure 4: This figure shows the cosine similarity heat map of emotion labels in the word embedding space.
  • Figure 5: The step by step workflow of the process of the whole experiment
  • ...and 7 more figures