Self-training Strategies for Sentiment Analysis: An Empirical Study

Haochen Liu; Sai Krishna Rallabandi; Yijing Wu; Parag Pravin Dakle; Preethi Raghavan

Self-training Strategies for Sentiment Analysis: An Empirical Study

Haochen Liu, Sai Krishna Rallabandi, Yijing Wu, Parag Pravin Dakle, Preethi Raghavan

TL;DR

This work tackles how unlabeled data can be exploited for sentiment analysis when labeled data are scarce, by empirically comparing self-training strategies for small language models (SLMs) and by integrating large language models (LLMs) in subject and object modes. It evaluates three instance-selection families—threshold-based, max/min-based, and soft-label—and analyzes how hyperparameters interact with $n$-shot settings across the LDC, MOSEI, and Financial PhraseBank datasets, using RoBERTa-base as the backbone. In addition, it investigates LLM-assisted self-training with Flan-UL2 and GPT-4, showing strong zero-shot performance for GPT-4 on open-domain data and domain-dependent gains from LLM prompts and data augmentation in the Financial PhraseBank. Overall, the study provides practical guidance on when and how to apply self-training strategies and LLM assistance to build robust sentiment classifiers under limited labeled data, highlighting the importance of data quality and task/domain considerations.

Abstract

Sentiment analysis is a crucial task in natural language processing that involves identifying and extracting subjective sentiment from text. Self-training has recently emerged as an economical and efficient technique for developing sentiment analysis models by leveraging a small amount of labeled data and a large amount of unlabeled data. However, given a set of training data, how to utilize them to conduct self-training makes a significant difference in the final performance of the model. We refer to this methodology as the self-training strategy. In this paper, we present an empirical study of various self-training strategies for sentiment analysis. First, we investigate the influence of the self-training strategy and hyper-parameters on the performance of traditional small language models (SLMs) in various few-shot settings. Second, we also explore the feasibility of leveraging large language models (LLMs) to help self-training. We propose and empirically compare several self-training strategies with the intervention of LLMs. Extensive experiments are conducted on three real-world sentiment analysis datasets.

Self-training Strategies for Sentiment Analysis: An Empirical Study

TL;DR

-shot settings across the LDC, MOSEI, and Financial PhraseBank datasets, using RoBERTa-base as the backbone. In addition, it investigates LLM-assisted self-training with Flan-UL2 and GPT-4, showing strong zero-shot performance for GPT-4 on open-domain data and domain-dependent gains from LLM prompts and data augmentation in the Financial PhraseBank. Overall, the study provides practical guidance on when and how to apply self-training strategies and LLM assistance to build robust sentiment classifiers under limited labeled data, highlighting the importance of data quality and task/domain considerations.

Abstract

Paper Structure (26 sections, 3 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 3 figures, 5 tables, 1 algorithm.

Introduction
Related Works
Self-training with SLMs
The Base Model
General Self-training Procedure
Instance Selection Strategies
Threshold-based
Max/Min-based
Soft Label
Experiments I: SLMs
Datasets
The LDC Dataset
The MOSEI Dataset
The Financial Phrasebank Dataset
Data Distributions
...and 11 more sections

Figures (3)

Figure 1: The x-axis indicates the threshold; the yellow bars represent the final number of unlabeled instances added to the training set; the blue line indicates the accuracy of inferring unlabeled instances; the red line indicates the F1 score of the well-trained model on the test set.
Figure 2: The prompts used for querying LLMs in the zero-shot and few-shot settings.
Figure 3: The performances of GPT-4 with the Obj-Conf-Score strategy. The x-axis indicates the thresholds of the confidence scores; the yellow bars represent the number of inferred instances selected for training; the blue line indicates the accuracy of the LLM inferring unlabeled instances; the red line indicates the F1 score of the well-trained SLM on the test set.

Self-training Strategies for Sentiment Analysis: An Empirical Study

TL;DR

Abstract

Self-training Strategies for Sentiment Analysis: An Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (3)