Recent Advancements and Challenges of Turkic Central Asian Language Processing

Yana Veitsman; Mareike Hartmann

Recent Advancements and Challenges of Turkic Central Asian Language Processing

Yana Veitsman, Mareike Hartmann

TL;DR

Central Asian Turkic languages face pronounced data scarcity and uneven NLP resource availability. The paper synthesizes linguistic features, data ecosystems, transfer learning opportunities, data augmentation, and current technology across Kazakh, Uzbek, Kyrgyz, and Turkmen, proposing concrete directions for future work. Key findings show Kazakh as the most resource-rich language, with Uzbek rapidly advancing, while Kyrgyz and Turkmen remain data-poor; multilingual and cross-language transfer approaches offer a viable path to uplift others. By highlighting the role of multilingual corpora, transliteration strategies, and data-augmentation methods, the work provides a practical roadmap for researchers and policymakers to accelerate progress toward higher-resource status for these languages.

Abstract

Research in NLP for Central Asian Turkic languages - Kazakh, Uzbek, Kyrgyz, and Turkmen - faces typical low-resource language challenges like data scarcity, limited linguistic resources and technology development. However, recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks. Thus, this paper aims to summarize recent progress and identify future research directions. It provides a high-level overview of each language's linguistic features, the current technology landscape, the application of transfer learning from higher-resource languages, and the availability of labeled and unlabeled data. By outlining the current state, we hope to inspire and facilitate future research.

Recent Advancements and Challenges of Turkic Central Asian Language Processing

TL;DR

Abstract

Paper Structure (33 sections, 1 figure, 4 tables)

This paper contains 33 sections, 1 figure, 4 tables.

Introduction
Related Work
Difficulties in Processing Turkic Languages
Overview
Similarities and Differences
Datasets Availability
Sources of Data and Stakeholders
Kazakh Language Datasets
Uzbek Language Datasets
Kyrgyz Language Datasets
Turkmen Language Datasets
Web-Scraped Datasets
Multilingual Datasets
Parallel Corpora
Classifying Languages by Data Availability
...and 18 more sections

Figures (1)

Figure 1: Distributions of numbers of Kazakh, Uzbek, Kyrgyz, and Turkmen native speakers among all Turkic language speakers. Numbers in the legend are approximates. Source: https://en.wikipedia.org/w/index.php?title=Languages_of_Asia&oldid=1230214231

Recent Advancements and Challenges of Turkic Central Asian Language Processing

TL;DR

Abstract

Recent Advancements and Challenges of Turkic Central Asian Language Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (1)