Table of Contents
Fetching ...

Predicting Emotion Intensity in Polish Political Texts: Comparing Supervised Models and Large Language Models in a Resource-Poor Language

Hubert Plisiecki, Piotr Koc, Maria Flakus, Artur Pokropek

TL;DR

This paper tackles predicting emotion intensity in Polish political texts, a challenging resource-poor language setting. It builds a 10,000-text annotated corpus and compares a supervised regression model against large language models (GPT-3.5/4) using in-context prompts, including a cost analysis. The results show that the supervised model generally achieves higher accuracy and lower variability than LLMs, though LLMs remain viable when annotation is expensive or unavailable. The work highlights practical trade-offs between data labeling effort and model performance, and provides shareable code and pretrained models to facilitate replication and extension to other languages and continuous emotion features.

Abstract

This study explores the use of large language models (LLMs) to predict emotion intensity in Polish political texts, a resource-poor language context. The research compares the performance of several LLMs against a supervised model trained on an annotated corpus of 10,000 social media texts, evaluated for the intensity of emotions by expert judges. The findings indicate that while the supervised model generally outperforms LLMs, offering higher accuracy and lower variance, LLMs present a viable alternative, especially given the high costs associated with data annotation. The study highlights the potential of LLMs in low-resource language settings and underscores the need for further research on emotion intensity prediction and its application across different languages and continuous features. The implications suggest a nuanced decision-making process to choose the right approach to emotion prediction for researchers and practitioners based on resource availability and the specific requirements of their tasks.

Predicting Emotion Intensity in Polish Political Texts: Comparing Supervised Models and Large Language Models in a Resource-Poor Language

TL;DR

This paper tackles predicting emotion intensity in Polish political texts, a challenging resource-poor language setting. It builds a 10,000-text annotated corpus and compares a supervised regression model against large language models (GPT-3.5/4) using in-context prompts, including a cost analysis. The results show that the supervised model generally achieves higher accuracy and lower variability than LLMs, though LLMs remain viable when annotation is expensive or unavailable. The work highlights practical trade-offs between data labeling effort and model performance, and provides shareable code and pretrained models to facilitate replication and extension to other languages and continuous emotion features.

Abstract

This study explores the use of large language models (LLMs) to predict emotion intensity in Polish political texts, a resource-poor language context. The research compares the performance of several LLMs against a supervised model trained on an annotated corpus of 10,000 social media texts, evaluated for the intensity of emotions by expert judges. The findings indicate that while the supervised model generally outperforms LLMs, offering higher accuracy and lower variance, LLMs present a viable alternative, especially given the high costs associated with data annotation. The study highlights the potential of LLMs in low-resource language settings and underscores the need for further research on emotion intensity prediction and its application across different languages and continuous features. The implications suggest a nuanced decision-making process to choose the right approach to emotion prediction for researchers and practitioners based on resource availability and the specific requirements of their tasks.
Paper Structure (26 sections, 3 figures, 8 tables)

This paper contains 26 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: The figure displays histograms created for three types of annotations: those made by the original raters, and those created by both GPT-3.5 and GPT-4. In order to compare these distributions directly, the original annotator labels, before averaging, were used to create the histogram. As each text was labeled by exactly 5 annotators, these labels were scaled by dividing by 5 to make them comparable to the labels generated by the LLMs.
  • Figure :
  • Figure :