Automatic register identification for the open web using multilingual deep learning
Erik Henriksson, Amanda Myntti, Saara Hellström, Anni Eskelinen, Selcen Erten-Johansson, Veronika Laippala
TL;DR
The study tackles automatic register identification over the open web in 16 languages by introducing the Multilingual CORE corpora with a 25-class CORE taxonomy and evaluating transformer-based multilingual classifiers. It demonstrates that hierarchical multi-label training yields strong performance (around 77–79% micro F1) and robust cross-lingual transfer, while data pruning can push results beyond 90% F1 on cleaner data. The work compares CORE with the simpler X-GENRE scheme, showing comparable performance but far richer granularity, and analyzes challenges posed by hybrids and labeling uncertainty. Practically, the approach supports improved metadata for web-scale NLP and more controlled, diverse training data for language technologies, especially benefiting low-resource languages, though zero-shot transfer remains imperfect. The findings offer guidance for data curation, model choice, and interpretability in multilingual web-register classification.
Abstract
This article presents multilingual deep learning models for identifying web registers -- text varieties such as news reports and discussion forums -- across 16 languages. We introduce the Multilingual CORE corpora, which contain over 72,000 documents annotated with a hierarchical taxonomy of 25 registers designed to cover the entire open web. Using multi-label classification, our best model achieves 79% F1 averaged across languages, matching or exceeding previous studies that used simpler classification schemes. This demonstrates that models can perform well even with a complex register scheme at multilingual scale. However, we observe a consistent performance ceiling across all models and configurations. When we remove documents with uncertain labels through data pruning, performance increases to over 90% F1, suggesting that this ceiling stems from inherent ambiguity in web registers rather than model limitations. Analysis of hybrid texts (those combining multiple registers) reveals that the main challenge lies not in classifying hybrids themselves, but in distinguishing hybrid from non-hybrid documents. Multilingual models consistently outperform monolingual ones, particularly for languages with limited training data. Zero-shot performance on unseen languages drops by an average of 7%, though this varies by language (3--8%), indicating that while registers share features across languages, they also retain language-specific characteristics.
