Table of Contents
Fetching ...

Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts

Thibault Clérice

TL;DR

This paper tackles the challenge of building Latin corpora for semantic analysis by evaluating sentence-level classifications using deep learning. It introduces the Lasciva Roma Dataset with 2,516 sentences spanning 300 BCE to 900 CE and benchmarks multiple encoders and embeddings, finding that Hierarchical Attention Networks with lemma embeddings yield strong discrimination (precison 70.60%, TPR 86.33%), while metadata-derived features can overfit. The study demonstrates that larger training sets improve performance and that attention mechanisms offer interpretable signals to assist human raters in filtering results. It also shows that, in limited-data or low-resource settings, BERT does not always outperform traditional RNN-based approaches, highlighting the continued relevance of HAN and lemma-based representations for Latin text processing. Overall, the work provides a practical, human-centered pathway for accelerating corpus construction in the humanities through interpretable deep learning models and openly available code and data.

Abstract

In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.

Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts

TL;DR

This paper tackles the challenge of building Latin corpora for semantic analysis by evaluating sentence-level classifications using deep learning. It introduces the Lasciva Roma Dataset with 2,516 sentences spanning 300 BCE to 900 CE and benchmarks multiple encoders and embeddings, finding that Hierarchical Attention Networks with lemma embeddings yield strong discrimination (precison 70.60%, TPR 86.33%), while metadata-derived features can overfit. The study demonstrates that larger training sets improve performance and that attention mechanisms offer interpretable signals to assist human raters in filtering results. It also shows that, in limited-data or low-resource settings, BERT does not always outperform traditional RNN-based approaches, highlighting the continued relevance of HAN and lemma-based representations for Latin text processing. Overall, the work provides a practical, human-centered pathway for accelerating corpus construction in the humanities through interpretable deep learning models and openly available code and data.

Abstract

In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.
Paper Structure (30 sections, 5 figures, 6 tables)

This paper contains 30 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Structure of the general architecture. Sentence categorical features are used (a) in the encoder, (b) in the linear layer, (c) or not used.
  • Figure 2: Concatenation of features to represent word-level informations.
  • Figure 3: Percentage of the corpus (in number of words) per 50-years increment.
  • Figure 4: Percentage of the corpus (in number of examples) per 50 year increment regarding the type of examples. Metonimies are counted as non-figurative.
  • Figure 5: Relative rank distribution of the sexual terms in TP and FN.