TurkicNLP: An NLP Toolkit for Turkic Languages

Sherzod Hakimov

TurkicNLP: An NLP Toolkit for Turkic Languages

Sherzod Hakimov

TL;DR

TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic, is presented.

Abstract

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .

TurkicNLP: An NLP Toolkit for Turkic Languages

TL;DR

Abstract

Paper Structure (29 sections, 2 figures, 4 tables)

This paper contains 29 sections, 2 figures, 4 tables.

Introduction
Background
Motivation
Contributions
Related Work
General-Purpose NLP Toolkits
Turkic-Specific NLP Tools and Pipelines
Multilingual Models and Benchmarks
Resource Collections and Community Initiatives
Turkic NLP Resource Landscape
Library Design and Architecture
Design Goals
Pipeline Architecture
Document Data Model
Processor Interface
...and 14 more sections

Figures (2)

Figure 1: Geographic distribution of the Turkic language family (Source: Wikimedia Commons).
Figure 2: High-level component architecture of TurkicNLP. The Pipeline orchestrates a sequential Processor chain; the Script subsystem handles automatic script detection and transliteration; Model Management resolves and downloads model artifacts via catalog.json; the Document Model provides a hierarchical output representation.

TurkicNLP: An NLP Toolkit for Turkic Languages

TL;DR

Abstract

TurkicNLP: An NLP Toolkit for Turkic Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (2)