Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation

Stephen Bothwell; Abigail Swenor; David Chiang

Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation

Stephen Bothwell, Abigail Swenor, David Chiang

TL;DR

This work tackles emotion polarity detection in Latin poetry under heavy data scarcity by introducing two clustering-based data augmentation methods: Polarity Coordinate Clustering and Gaussian Clustering. It integrates a neural architecture that leverages a range of Latin LLM embeddings, with a focus on PhilBERTa and Transformer/ BiLSTM encoders, and conducts a thorough hyperparameter search. The Gaussian-annotated data, combined with PhilBERTa embeddings and a Transformer encoder, achieves the second-best Macro-F1 on EvaLatin 2024, highlighting the effectiveness of distribution-aware automatic labeling in low-resource settings. The results also reveal that the Gaussian annotator yields a more balanced label distribution and that further improvements could arise from integrating noise-tolerant training akin to expectation-maximization.

Abstract

This paper describes submissions from the team Nostra Domina to the EvaLatin 2024 shared task of emotion polarity detection. Given the low-resource environment of Latin and the complexity of sentiment in rhetorical genres like poetry, we augmented the available data through automatic polarity annotation. We present two methods for doing so on the basis of the $k$-means algorithm, and we employ a variety of Latin large language models (LLMs) in a neural architecture to better capture the underlying contextual sentiment representations. Our best approach achieved the second highest macro-averaged Macro-$F_1$ score on the shared task's test set.

Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation

TL;DR

Abstract

-means algorithm, and we employ a variety of Latin large language models (LLMs) in a neural architecture to better capture the underlying contextual sentiment representations. Our best approach achieved the second highest macro-averaged Macro-

score on the shared task's test set.

Paper Structure (15 sections, 2 equations, 4 figures, 3 tables)

This paper contains 15 sections, 2 equations, 4 figures, 3 tables.

Introduction
Data
Automatic Annotation
Polarity Coordinate (PC) Clustering
Gaussian Clustering
Annotation Results
Modeling
Experiments
Experimental Design
Hyperparameter Search
Results
Conclusion
Acknowledgements
Bibliographical References
Language Resource References

Figures (4)

Figure 1: The polarity coordinate plane. Points are all colored differently to represent their classes and are labeled accordingly. The $x$-axis and $y$-axis represent polarity and intensity, respectively.
Figure 2: Architectural options fixed across hyperparameter search trials. Shapes reflect the relative dimensionality of data throughout the network.
Figure 3: Ranks and reported Macro-F1 score averages for our EvaLatin 2024 shared task submissions. The left and right tables are for the first and second submissions, respectively. Ranks range between 1 and 4, not accounting for the baseline. When a tie occurs, the best possible ranking is displayed.
Figure 4: Confusion matrices for our best-performing submission. The left matrix is for the whole EvaLatin 2024 test set, whereas the right matrix is for the Pontano subset. Darker colors indicate larger values on the heatmap; text colors are shifted for readability.

Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation

TL;DR

Abstract

Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)