PyThaiNLP: Thai Natural Language Processing in Python
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, Can Udomcharoenchaikit
TL;DR
PyThaiNLP addresses the lack of open, transparent Thai NLP tooling by delivering a comprehensive open-source Python library with tokenization, tagging, NER, coreference, embeddings, translation, and ASR capabilities. It outlines an ecosystem of features, datasets, and pre-trained models, along with community-driven milestones and industry adoption. The work demonstrates collaborations to scale models (e.g., WangchanBERTa, WangChanGLM) and a suite of Thai-specific datasets (e.g., VISTEC-TPTH-2020, Thai NER, Han-Coref, scb-mt-en-th-2020), underpinned by strong software quality practices. The authors highlight real-world impact across finance, telecom, retail, and services, and chart future directions toward domain specialization, robust benchmarks, deterministic processing, efficient loading, and broader integration with standard NLP ecosystems.
Abstract
We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.
