PhoGPT: Generative Pre-training for Vietnamese
Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Dinh Phung, Hung Bui
TL;DR
PhoGPT addresses the need for strong open-source Vietnamese LLMs by introducing PhoGPT-$4$B ($3.7$B parameters) trained from scratch on a Vietnamese corpus of about $102$B tokens with an $8192$-token context, plus PhoGPT-$4$B-Chat obtained via supervised fine-tuning on ~ $360$K instructional prompts and $290K$ conversations. The base model employs a decoder-only Transformer with a Vietnamese byte-level BPE of $20480$ tokens and is pre-trained using the MosaicML LLM-foundry with flash attention and ALiBi; the chat variant is tuned on a diverse instructional/conversational dataset to enhance alignment. In comprehensive evaluations on ViTruthfulQA, PhoGPT-$4$B-Chat demonstrates competitive performance with closed-models like GPT-4-0125 and outperforms many open-source baselines, especially on Vietnam-specific questions, highlighting the practical impact of open-source Vietnamese NLP tooling. The work provides an accessible, research-friendly pathway for Vietnamese NLP development, enabling broader experimentation and application using common libraries.
Abstract
We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. In addition, we also demonstrate its superior performance compared to previous open-source models. Our PhoGPT models are available at: https://github.com/VinAIResearch/PhoGPT
