Table of Contents
Fetching ...

DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling

Mariia Fedorova, Andrey Kutuzov, Khonzoda Umarova

TL;DR

DHPLT tackles the scarcity of multilingual diachronic corpora by releasing open, large-scale three-period corpora for 41 languages derived from HPLT v3.0, with crawl-timestamp-based temporal signals. It couples these corpora with precomputed target-word representations, including contextualized embeddings from T5 and XLM-R, lexical substitutions from GPT-BERT and XLM-R, and aligned static embeddings with frequency counts, enabling immediate multilingual LSCD experiments. The resource facilitates cross-language semantic-change studies beyond high-resource languages and supports long-horizon analyses across multiple time points. Sanity checks on representative terms such as AI/IA demonstrate plausible, time-consistent semantic drift and validate the usefulness of the provided representations. The data are openly accessible at the project data portal, promoting broad use in research and applications.

Abstract

In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.

DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling

TL;DR

DHPLT tackles the scarcity of multilingual diachronic corpora by releasing open, large-scale three-period corpora for 41 languages derived from HPLT v3.0, with crawl-timestamp-based temporal signals. It couples these corpora with precomputed target-word representations, including contextualized embeddings from T5 and XLM-R, lexical substitutions from GPT-BERT and XLM-R, and aligned static embeddings with frequency counts, enabling immediate multilingual LSCD experiments. The resource facilitates cross-language semantic-change studies beyond high-resource languages and supports long-horizon analyses across multiple time points. Sanity checks on representative terms such as AI/IA demonstrate plausible, time-consistent semantic drift and validate the usefulness of the provided representations. The data are openly accessible at the project data portal, promoting broad use in research and applications.

Abstract

In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.
Paper Structure (21 sections, 2 figures, 6 tables)

This paper contains 21 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Number of documents per crawl year in the HPLT v3.0 datasets: English (left) and Georgian (right).
  • Figure 2: Number of target words across 41 languages for all target words (top left), target words that are nouns (top right), verbs (bottom left), and adjectives (bottom right).