Table of Contents
Fetching ...

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Gokcen Gokceoglu, Devrim Cavusoglu, Emre Akbas, Özen Nergis Dolcerocca

TL;DR

Addressing the gap in NLP resources for historical, low-resource languages, the paper introduces a large-scale, open-access dataset of 19th-century Ottoman Turkish and Russian texts organized by a four-level taxonomy for multi-level, multi-label classification. It details data collection, expert labeling, and a baselined benchmarking suite using BoW naive Bayes and open LLMs (Llama-2-7b, Falcon-7b, mBERT), with findings that BoW can rival or exceed LLMs in low-resource settings. The results highlight the challenges of long documents, ultra-low-resource Ottoman data, and the impact of modeling choices such as 4-bit quantization and frozen backbones. The work aims to democratize NLP resources for historical languages and provides an openly accessible dataset to spur further research.

Abstract

This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

TL;DR

Addressing the gap in NLP resources for historical, low-resource languages, the paper introduces a large-scale, open-access dataset of 19th-century Ottoman Turkish and Russian texts organized by a four-level taxonomy for multi-level, multi-label classification. It details data collection, expert labeling, and a baselined benchmarking suite using BoW naive Bayes and open LLMs (Llama-2-7b, Falcon-7b, mBERT), with findings that BoW can rival or exceed LLMs in low-resource settings. The results highlight the challenges of long documents, ultra-low-resource Ottoman data, and the impact of modeling choices such as 4-bit quantization and frozen backbones. The work aims to democratize NLP resources for historical languages and provides an openly accessible dataset to spur further research.

Abstract

This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.
Paper Structure (17 sections, 3 figures, 4 tables)

This paper contains 17 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: A training instance from Ottoman (a) and Russian (b) collection samples from third level (i.e. Cultural Discourse $\rightarrow$ Modernization Subject) . The articles are truncated for better visual appearance.
  • Figure 2: Number of samples in each category for Ottoman Dataset
  • Figure 3: Number of samples in each category for Russian Dataset