Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models
Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai
TL;DR
Typhoon2 presents a comprehensive family of Thai-optimized LLMs spanning text, vision, and audio, built on open backbones (Llama3, Qwen2.5) with bilingual pretraining, extensive post-training, and Thai-specific safety. The work highlights data-centric CPT with diverse Thai sources, long-context adaptations, function calling, distillation, and model merging to boost instruction-following and multilingual performance. Multimodal components—Typhoon2-Vision and Typhoon2-Audio—demonstrate improved Thai OCR, document understanding, and end-to-end speech processing, with careful agentic data curation and Thai-centric evaluation. Public release of weights and safety tools aims to accelerate Thai-language AI development, while uncovering best practices for cross-lingual transfer, long-context utilization, and safety in low-resource languages.
Abstract
This paper introduces Typhoon 2, a series of text and multimodal large language models optimized for the Thai language. The series includes models for text, vision, and audio. Typhoon2-Text builds on state-of-the-art open models, such as Llama 3 and Qwen2, and we perform continual pre-training on a mixture of English and Thai data. We employ post-training techniques to enhance Thai language performance while preserving the base models' original capabilities. We release text models across a range of sizes, from 1 to 70 billion parameters, available in both base and instruction-tuned variants. To guardrail text generation, we release Typhoon2-Safety, a classifier enhanced for Thai cultures and language. Typhoon2-Vision improves Thai document understanding while retaining general visual capabilities, such as image captioning. Typhoon2-Audio introduces an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs and generating both text and speech outputs.
