Table of Contents
Fetching ...

JAI-1: A Thai-Centric Large Language Model

Attapol T. Rutherford, Jullajak Karnjanaekarin, Narongkorn Panitsrisit, Pontakorn Trakuekul, Sumana Sumanakul, Natchanon Pollertlam

TL;DR

JAI-1 presents a Thai-centric large language model built by upscaling a smaller English-capable backbone (phi-3-medium) to 75B parameters using tokenizer adaptation, depth-up-scaling, and mixture-of-experts. It employs a three-stage pretraining on 1.5T tokens with a Thai-dominant data injection and expands the context window to 32K, followed by supervised finetuning and alignment tuning with 600k instruction-based examples and DPO-based preferences. The model achieves strong Thai benchmarks, outperforming Typhoon2-70B on Thai-specific tasks and approaching or exceeding GPT-3.5-level English and Thai performance on Eng-H6, Thai-H6, and Thai-Exam, while maintaining broad general-language capabilities. The work demonstrates that careful architecture upscaling and structured data curation can elevate localized LLMs without eroding underlying general-language knowledge, offering a scalable blueprint for culturally aware NLP in underrepresented regions.

Abstract

This technical report introduces JAI-1, a Thai-centric language model with 75B parameters. Recent Thai models have primarily relied on existing open-source models, applying additional training without structural modifications to specialize in Thai. However, this approach risks eroding pre-existing knowledge in the model's parameter space during the injection of Thai-specific information, as optimized parameters for general tasks may conflict with new linguistic requirements. In contrast, JAI-1 adopts an upscaling strategy: starting from a smaller, high-performing English open-source LLM, we expanded its parameter space and utilized the newly allocated capacity to systematically integrate Thai-language knowledge. This methodology not only preserves the original model's general intelligence but also establishes a unique architecture distinct from other open-source models, enabling scalable future enhancements. During pre-training, JAI-1 was exposed to 1.5T tokens, including over 300B Thai language tokens. This was followed by post-training stages -- supervised fine-tuning and alignment tuning -- using more than 600K instruction-based examples. The final model demonstrated superior performance compared to Typhoon2-70B on Thai-centric benchmarks (IFEval-TH, MT-Bench-TH, and JAI-Hall-Bench), validating the efficacy of its upscaling and knowledge-integration framework.

JAI-1: A Thai-Centric Large Language Model

TL;DR

JAI-1 presents a Thai-centric large language model built by upscaling a smaller English-capable backbone (phi-3-medium) to 75B parameters using tokenizer adaptation, depth-up-scaling, and mixture-of-experts. It employs a three-stage pretraining on 1.5T tokens with a Thai-dominant data injection and expands the context window to 32K, followed by supervised finetuning and alignment tuning with 600k instruction-based examples and DPO-based preferences. The model achieves strong Thai benchmarks, outperforming Typhoon2-70B on Thai-specific tasks and approaching or exceeding GPT-3.5-level English and Thai performance on Eng-H6, Thai-H6, and Thai-Exam, while maintaining broad general-language capabilities. The work demonstrates that careful architecture upscaling and structured data curation can elevate localized LLMs without eroding underlying general-language knowledge, offering a scalable blueprint for culturally aware NLP in underrepresented regions.

Abstract

This technical report introduces JAI-1, a Thai-centric language model with 75B parameters. Recent Thai models have primarily relied on existing open-source models, applying additional training without structural modifications to specialize in Thai. However, this approach risks eroding pre-existing knowledge in the model's parameter space during the injection of Thai-specific information, as optimized parameters for general tasks may conflict with new linguistic requirements. In contrast, JAI-1 adopts an upscaling strategy: starting from a smaller, high-performing English open-source LLM, we expanded its parameter space and utilized the newly allocated capacity to systematically integrate Thai-language knowledge. This methodology not only preserves the original model's general intelligence but also establishes a unique architecture distinct from other open-source models, enabling scalable future enhancements. During pre-training, JAI-1 was exposed to 1.5T tokens, including over 300B Thai language tokens. This was followed by post-training stages -- supervised fine-tuning and alignment tuning -- using more than 600K instruction-based examples. The final model demonstrated superior performance compared to Typhoon2-70B on Thai-centric benchmarks (IFEval-TH, MT-Bench-TH, and JAI-Hall-Bench), validating the efficacy of its upscaling and knowledge-integration framework.

Paper Structure

This paper contains 38 sections, 1 equation, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Architectural overview of the JAI-1 (75B) model. JAI-1 is a Thai-optimized LLM developed by scaling the Phi-3-medium (14B) architecture. While Phi-3-medium excels in general knowledge benchmarks, JAI-1 significantly enhances its parameter count to 75B through depth-up-scaling (DUS) and mixture-of-experts (MoE) techniques. Additionally, the model incorporates a redesigned Thai-optimized tokenizer to improve representation of Thai script and linguistic features. In the figure, GQA and FFN represent grouped-query attention and feed-forward network layers respectively.
  • Figure 2: Visual illustration of the key differences between two depth-scaling strategies in the DUS framework. The original DUS method, referred to as DUS.v1, extends model depth by selecting K layers from the input side and K layers from the output side, then concatenating these two sets to form a deeper architecture with 2K layers. In contrast, DUS.v2 introduces a block-level approach: the original model is divided into multiple blocks, each containing B consecutive layers, and depth is scaled by duplicating selected blocks in their original sequential order.