Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
Khang T. Doan, Bao G. Huynh, Dung T. Hoang, Thuc D. Pham, Nhat H. Pham, Quan T. M. Nguyen, Bang Q. Vo, Suong N. Hoang
TL;DR
Vintern-1B addresses the shortage of Vietnamese multimodal benchmarks by integrating a vision foundation (InternViT-300M-448px) with a Vietnamese-capable language model (Qwen2-0.5B-Instruct) and fine-tuning on a large VN-centric image QA corpus. The approach uses dynamic high-resolution tiling, a two-layer MLP projector, and LoRA-based fine-tuning to enable on-device deployment while covering OCR, document understanding, and VQA. A comprehensive Vietnamese VQA dataset suite, enhanced with Gemini 1.5 Flash prompting, supports general QA, OCR, handwriting, and information extraction tasks. Evaluation with GPT-4o and VLSP benchmarks demonstrates competitive performance, particularly in OCR-related tasks, and the work contributes open Vietnamese VQA datasets to advance local resources and research.
Abstract
In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
