Table of Contents
Fetching ...

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

Khang T. Doan, Bao G. Huynh, Dung T. Hoang, Thuc D. Pham, Nhat H. Pham, Quan T. M. Nguyen, Bang Q. Vo, Suong N. Hoang

TL;DR

Vintern-1B addresses the shortage of Vietnamese multimodal benchmarks by integrating a vision foundation (InternViT-300M-448px) with a Vietnamese-capable language model (Qwen2-0.5B-Instruct) and fine-tuning on a large VN-centric image QA corpus. The approach uses dynamic high-resolution tiling, a two-layer MLP projector, and LoRA-based fine-tuning to enable on-device deployment while covering OCR, document understanding, and VQA. A comprehensive Vietnamese VQA dataset suite, enhanced with Gemini 1.5 Flash prompting, supports general QA, OCR, handwriting, and information extraction tasks. Evaluation with GPT-4o and VLSP benchmarks demonstrates competitive performance, particularly in OCR-related tasks, and the work contributes open Vietnamese VQA datasets to advance local resources and research.

Abstract

In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

TL;DR

Vintern-1B addresses the shortage of Vietnamese multimodal benchmarks by integrating a vision foundation (InternViT-300M-448px) with a Vietnamese-capable language model (Qwen2-0.5B-Instruct) and fine-tuning on a large VN-centric image QA corpus. The approach uses dynamic high-resolution tiling, a two-layer MLP projector, and LoRA-based fine-tuning to enable on-device deployment while covering OCR, document understanding, and VQA. A comprehensive Vietnamese VQA dataset suite, enhanced with Gemini 1.5 Flash prompting, supports general QA, OCR, handwriting, and information extraction tasks. Evaluation with GPT-4o and VLSP benchmarks demonstrates competitive performance, particularly in OCR-related tasks, and the work contributes open Vietnamese VQA datasets to advance local resources and research.

Abstract

In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
Paper Structure (18 sections, 1 figure, 1 table)

This paper contains 18 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overall Architecture. Vintern-1B is built upon the ViT-MLP-LLM framework, following the structure of well-known MLLMs (chen2024internvlliu2023llavaliu2023improvedllavaliu2024improvedllava1.6). It inherits from InternVL 1.5 chen2024internvl, integrating a pre-trained InternViT-300M-448px chen2024internvl with Qwen2-0.5B-Instruct yang2024qwen2 via an MLP projector. The input image is processed by the Dynamic High Resolution module, which splits it into smaller 448x448 pixel images along with a thumbnail. These images are then passed through InternViT-300M-448px to extract visual features. A Pixel Shuffle step is also applied before feeding the data into the MLP projector to align it with the embeddings of the large language model Qwen2-0.5B-Instruct which takes the aligned visual tokens and the related question as inputs, and generates the corresponding answer.