Table of Contents
Fetching ...

LaVy: Vietnamese Multimodal Large Language Model

Chi Tran, Huong Le Thanh

TL;DR

This work addresses the lack of high-quality Vietnamese multimodal resources by introducing LaVy, a Vietnamese Multimodal Large Language Model, and LaVy-Bench, a benchmark for Vietnamese visual-language understanding. LaVy adopts a CLIP-Large vision encoder, a two-layer MLP projector, and a Vistral-7B LLM, trained in a two-stage process with extensive data curation and LoRA-based finetuning. The authors demonstrate state-of-the-art performance on Vietnamese V-L tasks, including zero-shot VQA and in-the-wild evaluations, outperforming multilingual baselines. They also provide a standardized evaluation framework to accelerate future research in Vietnamese multimodal AI, while acknowledging data limitations and hallucination as ongoing challenges and outlining future work toward broader tasks such as OCR and counting.

Abstract

Large Language Models (LLMs) and Multimodal Large language models (MLLMs) have taken the world by storm with impressive abilities in complex reasoning and linguistic comprehension. Meanwhile there are plethora of works related to Vietnamese Large Language Models, the lack of high-quality resources in multimodality limits the progress of Vietnamese MLLMs. In this paper, we pioneer in address this by introducing LaVy, a state-of-the-art Vietnamese MLLM, and we also introduce LaVy-Bench benchmark designated for evaluating MLLMs's understanding on Vietnamese visual language tasks. Our project is public at https://github.com/baochi0212/LaVy

LaVy: Vietnamese Multimodal Large Language Model

TL;DR

This work addresses the lack of high-quality Vietnamese multimodal resources by introducing LaVy, a Vietnamese Multimodal Large Language Model, and LaVy-Bench, a benchmark for Vietnamese visual-language understanding. LaVy adopts a CLIP-Large vision encoder, a two-layer MLP projector, and a Vistral-7B LLM, trained in a two-stage process with extensive data curation and LoRA-based finetuning. The authors demonstrate state-of-the-art performance on Vietnamese V-L tasks, including zero-shot VQA and in-the-wild evaluations, outperforming multilingual baselines. They also provide a standardized evaluation framework to accelerate future research in Vietnamese multimodal AI, while acknowledging data limitations and hallucination as ongoing challenges and outlining future work toward broader tasks such as OCR and counting.

Abstract

Large Language Models (LLMs) and Multimodal Large language models (MLLMs) have taken the world by storm with impressive abilities in complex reasoning and linguistic comprehension. Meanwhile there are plethora of works related to Vietnamese Large Language Models, the lack of high-quality resources in multimodality limits the progress of Vietnamese MLLMs. In this paper, we pioneer in address this by introducing LaVy, a state-of-the-art Vietnamese MLLM, and we also introduce LaVy-Bench benchmark designated for evaluating MLLMs's understanding on Vietnamese visual language tasks. Our project is public at https://github.com/baochi0212/LaVy
Paper Structure (15 sections, 1 figure, 3 tables)