Multilingual Attribute Extraction from News Web Pages
Pavel Bedrin, Maksim Varlamov, Alexander Yatskov
TL;DR
The paper tackles multilingual extraction of news article attributes from web pages by expanding a Russian dataset to six languages and comparing English-pretrained MarkupLM with a newly trained multilingual DOM-LM. Through careful dataset construction, DOM-tree and textual feature engineering, and diverse evaluation settings, the authors show that translating pages to English can boost MarkupLM performance, but a multilingual DOM-LM achieves competitive or superior results without translation in most scenarios. The work also benchmarks open-source tools and Zyte datasets, highlighting gaps in non-English attribute extraction and establishing a practical approach for multilingual web data extraction. Overall, the multilingual DOM-LM demonstrates strong cross-language generalization and practical utility for multilingual news analysis and aggregation, with room for improvement via more balanced pre-training data and broader labeling across languages.
Abstract
This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages. Recent neural network models have shown high efficacy in extracting information from semi-structured web pages. However, these models are predominantly applied to domains like e-commerce and are pre-trained using English data, complicating their application to web pages in other languages. We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic) from 161 websites. The dataset is publicly available on GitHub. We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality. Additionally, we pre-trained another state-of-the-art model, DOM-LM, on multilingual data and fine-tuned it on our dataset. We compared both fine-tuned models to existing open-source news data extraction tools, achieving superior extraction metrics.
