Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation
Stephen Bothwell, Abigail Swenor, David Chiang
TL;DR
This work tackles emotion polarity detection in Latin poetry under heavy data scarcity by introducing two clustering-based data augmentation methods: Polarity Coordinate Clustering and Gaussian Clustering. It integrates a neural architecture that leverages a range of Latin LLM embeddings, with a focus on PhilBERTa and Transformer/ BiLSTM encoders, and conducts a thorough hyperparameter search. The Gaussian-annotated data, combined with PhilBERTa embeddings and a Transformer encoder, achieves the second-best Macro-F1 on EvaLatin 2024, highlighting the effectiveness of distribution-aware automatic labeling in low-resource settings. The results also reveal that the Gaussian annotator yields a more balanced label distribution and that further improvements could arise from integrating noise-tolerant training akin to expectation-maximization.
Abstract
This paper describes submissions from the team Nostra Domina to the EvaLatin 2024 shared task of emotion polarity detection. Given the low-resource environment of Latin and the complexity of sentiment in rhetorical genres like poetry, we augmented the available data through automatic polarity annotation. We present two methods for doing so on the basis of the $k$-means algorithm, and we employ a variety of Latin large language models (LLMs) in a neural architecture to better capture the underlying contextual sentiment representations. Our best approach achieved the second highest macro-averaged Macro-$F_1$ score on the shared task's test set.
