Table of Contents
Fetching ...

Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models

Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai

TL;DR

Typhoon2 presents a comprehensive family of Thai-optimized LLMs spanning text, vision, and audio, built on open backbones (Llama3, Qwen2.5) with bilingual pretraining, extensive post-training, and Thai-specific safety. The work highlights data-centric CPT with diverse Thai sources, long-context adaptations, function calling, distillation, and model merging to boost instruction-following and multilingual performance. Multimodal components—Typhoon2-Vision and Typhoon2-Audio—demonstrate improved Thai OCR, document understanding, and end-to-end speech processing, with careful agentic data curation and Thai-centric evaluation. Public release of weights and safety tools aims to accelerate Thai-language AI development, while uncovering best practices for cross-lingual transfer, long-context utilization, and safety in low-resource languages.

Abstract

This paper introduces Typhoon 2, a series of text and multimodal large language models optimized for the Thai language. The series includes models for text, vision, and audio. Typhoon2-Text builds on state-of-the-art open models, such as Llama 3 and Qwen2, and we perform continual pre-training on a mixture of English and Thai data. We employ post-training techniques to enhance Thai language performance while preserving the base models' original capabilities. We release text models across a range of sizes, from 1 to 70 billion parameters, available in both base and instruction-tuned variants. To guardrail text generation, we release Typhoon2-Safety, a classifier enhanced for Thai cultures and language. Typhoon2-Vision improves Thai document understanding while retaining general visual capabilities, such as image captioning. Typhoon2-Audio introduces an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs and generating both text and speech outputs.

Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models

TL;DR

Typhoon2 presents a comprehensive family of Thai-optimized LLMs spanning text, vision, and audio, built on open backbones (Llama3, Qwen2.5) with bilingual pretraining, extensive post-training, and Thai-specific safety. The work highlights data-centric CPT with diverse Thai sources, long-context adaptations, function calling, distillation, and model merging to boost instruction-following and multilingual performance. Multimodal components—Typhoon2-Vision and Typhoon2-Audio—demonstrate improved Thai OCR, document understanding, and end-to-end speech processing, with careful agentic data curation and Thai-centric evaluation. Public release of weights and safety tools aims to accelerate Thai-language AI development, while uncovering best practices for cross-lingual transfer, long-context utilization, and safety in low-resource languages.

Abstract

This paper introduces Typhoon 2, a series of text and multimodal large language models optimized for the Thai language. The series includes models for text, vision, and audio. Typhoon2-Text builds on state-of-the-art open models, such as Llama 3 and Qwen2, and we perform continual pre-training on a mixture of English and Thai data. We employ post-training techniques to enhance Thai language performance while preserving the base models' original capabilities. We release text models across a range of sizes, from 1 to 70 billion parameters, available in both base and instruction-tuned variants. To guardrail text generation, we release Typhoon2-Safety, a classifier enhanced for Thai cultures and language. Typhoon2-Vision improves Thai document understanding while retaining general visual capabilities, such as image captioning. Typhoon2-Audio introduces an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs and generating both text and speech outputs.

Paper Structure

This paper contains 71 sections, 3 equations, 10 figures, 39 tables.

Figures (10)

  • Figure 1: Thai Pretraining Data Mixture
  • Figure 2: Evaluation of Typhoon2-Llama3.1-8B-Instruct on Needle-in-a-Haystack for both English (Left) and Thai (Right).
  • Figure 3: Evaluation of Typhoon2-Llama3.1-70B-Instruct on Needle-in-a-Haystack for both English (Left) and Thai (Right).
  • Figure 4: Evaluation of Typhoon2-Qwen2.5-7B-Instruct on Needle-in-a-Haystack for both English (Left) and Thai (Right).
  • Figure 5: Pipeline of Thai topic data generation
  • ...and 5 more figures