Training BERT Models to Carry Over a Coding System Developed on One Corpus to Another

Dalma Galambos; Pál Zsámboki

Training BERT Models to Carry Over a Coding System Developed on One Corpus to Another

Dalma Galambos, Pál Zsámboki

TL;DR

This work addresses transferring a domain-specific, multilabel-content and multiclass-context coding system for literary translation from one Hungarian journal to another using BERT-based models. It combines domain adaptation pretraining, careful finetuning with imbalanced-label losses, and 10-fold crossvalidation to build robust ensembles, while evaluating resistance to domain shift via a target-domain test set and bootstrapped confidence intervals. The study demonstrates that OCR-domain adaptation can yield substantial gains comparable to corpus-domain adaptation, and that content-label transfer remains reliable across domains, with qualitative analysis offering insight into remaining misclassifications. The results advance understanding of cross-domain annotation transfer in literary studies and illustrate practical methods for quantifying transfer confidence and comparing baselines at scale, including extensive baselines and a diverse set of model variants.

Abstract

This paper describes how we train BERT models to carry over a coding system developed on the paragraphs of a Hungarian literary journal to another. The aim of the coding system is to track trends in the perception of literary translation around the political transformation in 1989 in Hungary. To evaluate not only task performance but also the consistence of the annotation, moreover, to get better predictions from an ensemble, we use 10-fold crossvalidation. Extensive hyperparameter tuning is used to obtain the best possible results and fair comparisons. To handle label imbalance, we use loss functions and metrics robust to it. Evaluation of the effect of domain shift is carried out by sampling a test set from the target domain. We establish the sample size by estimating the bootstrapped confidence interval via simulations. This way, we show that our models can carry over one annotation system to the target domain. Comparisons are drawn to provide insights such as learning multilabel correlations and confidence penalty improve resistance to domain shift, and domain adaptation on OCR-ed text on another domain improves performance almost to the same extent as that on the corpus under study. See our code at https://codeberg.org/zsamboki/bert-annotator-ensemble.

Training BERT Models to Carry Over a Coding System Developed on One Corpus to Another

TL;DR

Abstract

Paper Structure (52 sections, 2 figures, 2 tables)

This paper contains 52 sections, 2 figures, 2 tables.

Introduction
Objective of the Large Pilot Project Providing the Broader Context of the Present Paper
Scope of the Present Paper, Main Contributions
Related Work
Dataset
Corpus: Alföld and Nagyvilág, Two Hungarian Literary Journals from the Period under Examination
Manual Annotation of Alföld
Preprocessing Pipeline: from Page Scans to Paragraph Texts
Dataset Statistics and Further Transformations
Paragraph and Word Counts
Pruning Alföld for the Finetuning Set
Label Statistics
10-fold Stratification
Pruning and Truncating Paragraphs from Nagyvilág for the Target Domain
Training
...and 37 more sections

Figures (2)

Figure 1: Dataset statistics and evaluation results. (a) Content label counts. (b) Context label counts. The context label with index 0 refers to paragraphs that contain the subword "fordí" but are unrelated to translation. (c) Content label correlations expressed as conditional probabilities. (d) Content label evaluation results by label (ROC AUC). (e) Context label evaluation results by label (accuracy). In (d) and (e), 10-fold crossvalidation results are dark blue, and test set results are orange.
Figure 2: Sample size increase to confidence interval decrease simulation via bootstrap.

Training BERT Models to Carry Over a Coding System Developed on One Corpus to Another

TL;DR

Abstract

Training BERT Models to Carry Over a Coding System Developed on One Corpus to Another

Authors

TL;DR

Abstract

Table of Contents

Figures (2)