Table of Contents
Fetching ...

OpenJAI-v1.0: An Open Thai Large Language Model

Pontakorn Trakuekul, Attapol T. Rutherford, Jullajak Karnjanaekarin, Narongkorn Panitsrisit, Sumana Sumanakul

TL;DR

OpenJAI-v1.0 tackles the gap in Thai-language AI capabilities by finetuning a robust base on carefully curated Thai–English data, emphasizing instruction following, long-context understanding, and tool use. The authors employ strict data curation, LLM-based quality control, and a streamlined 462M-token training regime on an 8xH100 cluster, while evaluating on a diverse battery of benchmarks to ensure generalization without catastrophic forgetting. The results show OpenJAI-v1.0 achieving strong instruction-following, competitive multi-turn dialogue, robust long-context reasoning, and excellent tool-calling ability, often outperforming other open-source Thai models and approaching or matching proprietary baselines in several tasks. This work contributes a publicly available, practically capable Thai–English foundation resource that strengthens the Thai NLP ecosystem and demonstrates the value of targeted post-training for real-world utility.

Abstract

We introduce OpenJAI-v1.0, an open-source large language model for Thai and English, developed from the Qwen3-14B model. Our work focuses on boosting performance on practical tasks through carefully curated data across three key use cases: instruction following, long-context understanding, and tool use. Evaluation results show that OpenJAI-v1.0 improves on the capabilities of its base model and outperforms other leading open-source Thai models on a diverse suite of benchmarks, while avoiding catastrophic forgetting. OpenJAI-v1.0 is publicly released as another alternative NLP resource for the Thai AI community.

OpenJAI-v1.0: An Open Thai Large Language Model

TL;DR

OpenJAI-v1.0 tackles the gap in Thai-language AI capabilities by finetuning a robust base on carefully curated Thai–English data, emphasizing instruction following, long-context understanding, and tool use. The authors employ strict data curation, LLM-based quality control, and a streamlined 462M-token training regime on an 8xH100 cluster, while evaluating on a diverse battery of benchmarks to ensure generalization without catastrophic forgetting. The results show OpenJAI-v1.0 achieving strong instruction-following, competitive multi-turn dialogue, robust long-context reasoning, and excellent tool-calling ability, often outperforming other open-source Thai models and approaching or matching proprietary baselines in several tasks. This work contributes a publicly available, practically capable Thai–English foundation resource that strengthens the Thai NLP ecosystem and demonstrates the value of targeted post-training for real-world utility.

Abstract

We introduce OpenJAI-v1.0, an open-source large language model for Thai and English, developed from the Qwen3-14B model. Our work focuses on boosting performance on practical tasks through carefully curated data across three key use cases: instruction following, long-context understanding, and tool use. Evaluation results show that OpenJAI-v1.0 improves on the capabilities of its base model and outperforms other leading open-source Thai models on a diverse suite of benchmarks, while avoiding catastrophic forgetting. OpenJAI-v1.0 is publicly released as another alternative NLP resource for the Thai AI community.

Paper Structure

This paper contains 10 sections, 1 table.