Automatic register identification for the open web using multilingual deep learning

Erik Henriksson; Amanda Myntti; Saara Hellström; Anni Eskelinen; Selcen Erten-Johansson; Veronika Laippala

Automatic register identification for the open web using multilingual deep learning

Erik Henriksson, Amanda Myntti, Saara Hellström, Anni Eskelinen, Selcen Erten-Johansson, Veronika Laippala

TL;DR

The study tackles automatic register identification over the open web in 16 languages by introducing the Multilingual CORE corpora with a 25-class CORE taxonomy and evaluating transformer-based multilingual classifiers. It demonstrates that hierarchical multi-label training yields strong performance (around 77–79% micro F1) and robust cross-lingual transfer, while data pruning can push results beyond 90% F1 on cleaner data. The work compares CORE with the simpler X-GENRE scheme, showing comparable performance but far richer granularity, and analyzes challenges posed by hybrids and labeling uncertainty. Practically, the approach supports improved metadata for web-scale NLP and more controlled, diverse training data for language technologies, especially benefiting low-resource languages, though zero-shot transfer remains imperfect. The findings offer guidance for data curation, model choice, and interpretability in multilingual web-register classification.

Abstract

This article presents multilingual deep learning models for identifying web registers -- text varieties such as news reports and discussion forums -- across 16 languages. We introduce the Multilingual CORE corpora, which contain over 72,000 documents annotated with a hierarchical taxonomy of 25 registers designed to cover the entire open web. Using multi-label classification, our best model achieves 79% F1 averaged across languages, matching or exceeding previous studies that used simpler classification schemes. This demonstrates that models can perform well even with a complex register scheme at multilingual scale. However, we observe a consistent performance ceiling across all models and configurations. When we remove documents with uncertain labels through data pruning, performance increases to over 90% F1, suggesting that this ceiling stems from inherent ambiguity in web registers rather than model limitations. Analysis of hybrid texts (those combining multiple registers) reveals that the main challenge lies not in classifying hybrids themselves, but in distinguishing hybrid from non-hybrid documents. Multilingual models consistently outperform monolingual ones, particularly for languages with limited training data. Zero-shot performance on unseen languages drops by an average of 7%, though this varies by language (3--8%), indicating that while registers share features across languages, they also retain language-specific characteristics.

Automatic register identification for the open web using multilingual deep learning

TL;DR

Abstract

Paper Structure (40 sections, 10 figures, 16 tables)

This paper contains 40 sections, 10 figures, 16 tables.

Introduction
Background
Terminology: register vs. genre
Challenges in web register identification
Methodological developments
Towards multilingual CORE
Materials
Web register datasets
Corpus compilation and annotation strategies
Register taxonomy and distribution
CORE conversion to X-GENRE
Methods
Data splits
Label encoding
Language imbalance
...and 25 more sections

Figures (10)

Figure 1: Distribution of text lengths across the Multilingual CORE Corpora, measured in SentencePiece tokens. Documents longer than 8,192 tokens (2% of the dataset) are not shown for clarity.
Figure 2: Main registers (right) and subregisters (left) with their distribution across the large corpora. The Other categories include documents that could not be assigned to a subregister. In the classification process, these documents are assigned only the main label(s).
Figure 3: Percentage heatmap of class co-occurrences across 16 languages. Totals (in parentheses) show overall class occurrences. Diagonal cells display percentages of singly appearing labels, and off-diagonal cells show co-occurrence (i.e. hybrid) proportions.
Figure 4: Mapping from CORE main registers (left) and subregisters (middle) to XGENRE (right).
Figure 5: Data sampling process for multilingual fine-tuning, illustrating the repeated random selection and automatic replenishment of language datasets.
...and 5 more figures

Automatic register identification for the open web using multilingual deep learning

TL;DR

Abstract

Automatic register identification for the open web using multilingual deep learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)