PhoGPT: Generative Pre-training for Vietnamese

Dat Quoc Nguyen; Linh The Nguyen; Chi Tran; Dung Ngoc Nguyen; Dinh Phung; Hung Bui

PhoGPT: Generative Pre-training for Vietnamese

Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Dinh Phung, Hung Bui

TL;DR

PhoGPT addresses the need for strong open-source Vietnamese LLMs by introducing PhoGPT-$4$B ($3.7$B parameters) trained from scratch on a Vietnamese corpus of about $102$B tokens with an $8192$-token context, plus PhoGPT-$4$B-Chat obtained via supervised fine-tuning on ~ $360$K instructional prompts and $290K$ conversations. The base model employs a decoder-only Transformer with a Vietnamese byte-level BPE of $20480$ tokens and is pre-trained using the MosaicML LLM-foundry with flash attention and ALiBi; the chat variant is tuned on a diverse instructional/conversational dataset to enhance alignment. In comprehensive evaluations on ViTruthfulQA, PhoGPT-$4$B-Chat demonstrates competitive performance with closed-models like GPT-4-0125 and outperforms many open-source baselines, especially on Vietnam-specific questions, highlighting the practical impact of open-source Vietnamese NLP tooling. The work provides an accessible, research-friendly pathway for Vietnamese NLP development, enabling broader experimentation and application using common libraries.

Abstract

We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. In addition, we also demonstrate its superior performance compared to previous open-source models. Our PhoGPT models are available at: https://github.com/VinAIResearch/PhoGPT

PhoGPT: Generative Pre-training for Vietnamese

TL;DR

PhoGPT addresses the need for strong open-source Vietnamese LLMs by introducing PhoGPT-

B (

B parameters) trained from scratch on a Vietnamese corpus of about

B tokens with an

-token context, plus PhoGPT-

B-Chat obtained via supervised fine-tuning on ~

K instructional prompts and

conversations. The base model employs a decoder-only Transformer with a Vietnamese byte-level BPE of

tokens and is pre-trained using the MosaicML LLM-foundry with flash attention and ALiBi; the chat variant is tuned on a diverse instructional/conversational dataset to enhance alignment. In comprehensive evaluations on ViTruthfulQA, PhoGPT-

B-Chat demonstrates competitive performance with closed-models like GPT-4-0125 and outperforms many open-source baselines, especially on Vietnam-specific questions, highlighting the practical impact of open-source Vietnamese NLP tooling. The work provides an accessible, research-friendly pathway for Vietnamese NLP development, enabling broader experimentation and application using common libraries.

Abstract

Paper Structure (6 sections, 1 table)

This paper contains 6 sections, 1 table.

Introduction
PhoGPT
PhoGPT-4B: Model architecture and Pre-training
PhoGPT-4B-Chat: Supervised fine-tuning
Evaluation
Conclusion

PhoGPT: Generative Pre-training for Vietnamese

TL;DR

Abstract

PhoGPT: Generative Pre-training for Vietnamese

Authors

TL;DR

Abstract

Table of Contents