Table of Contents
Fetching ...

VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

Jeongho Ju, Daeyoung Kim, SunYoung Park, Youngjune Kim

TL;DR

VARCO-VISION-14B addresses the scarcity of open Korean-English vision-language models by introducing a strong bilingual VLM built on SigLIP and Qwen-2.5-14B-Instruct within a LLaVA-OneVision framework. It employs a four-stage training pipeline to preserve backbone knowledge while integrating vision-language capabilities, including OCR, grounding, and referring, and it leverages text-only data to bolster language proficiency. The paper releases five Korean benchmarks (four closed-set MCQA and one open-set generation) and demonstrates competitive or superior performance on Korean and English benchmarks, including OCR and text-only tasks, relative to similar-scale open-source models and approaching proprietary baselines. The work emphasizes open research and practical impact by providing benchmarks and a scalable bilingual VLM with grounded multimodal capabilities, paving the way for broader multilingual AI applications in Korea and beyond.

Abstract

In this paper, we introduce an open-source Korean-English vision-language model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that allows a model learn both linguistic and visual information while preserving the backbone model's knowledge. Our model demonstrates outstanding performance in diverse settings requiring bilingual image-text understanding and generation abilities compared to models of similar size. VARCO-VISION is also capable of grounding, referring, and OCR, expanding its usage and potential applications for real-world scenarios. In addition to the model, we release five Korean evaluation datasets, including four closed-set and one openset benchmarks. We anticipate that our milestone will broaden the opportunities for AI researchers aiming to train VLMs. VARCO-VISION is available at https://huggingface.co/NCSOFT/VARCO-VISION-14B.

VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

TL;DR

VARCO-VISION-14B addresses the scarcity of open Korean-English vision-language models by introducing a strong bilingual VLM built on SigLIP and Qwen-2.5-14B-Instruct within a LLaVA-OneVision framework. It employs a four-stage training pipeline to preserve backbone knowledge while integrating vision-language capabilities, including OCR, grounding, and referring, and it leverages text-only data to bolster language proficiency. The paper releases five Korean benchmarks (four closed-set MCQA and one open-set generation) and demonstrates competitive or superior performance on Korean and English benchmarks, including OCR and text-only tasks, relative to similar-scale open-source models and approaching proprietary baselines. The work emphasizes open research and practical impact by providing benchmarks and a scalable bilingual VLM with grounded multimodal capabilities, paving the way for broader multilingual AI applications in Korea and beyond.

Abstract

In this paper, we introduce an open-source Korean-English vision-language model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that allows a model learn both linguistic and visual information while preserving the backbone model's knowledge. Our model demonstrates outstanding performance in diverse settings requiring bilingual image-text understanding and generation abilities compared to models of similar size. VARCO-VISION is also capable of grounding, referring, and OCR, expanding its usage and potential applications for real-world scenarios. In addition to the model, we release five Korean evaluation datasets, including four closed-set and one openset benchmarks. We anticipate that our milestone will broaden the opportunities for AI researchers aiming to train VLMs. VARCO-VISION is available at https://huggingface.co/NCSOFT/VARCO-VISION-14B.

Paper Structure

This paper contains 22 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: V ARCO- V ISION Application Examples: Visual Question Answering (VQA), Optical Character Recognition (OCR), Referring, and Grounding. Our model excels at both Korean/English vision-text and text-only tasks. Please see \ref{['appendix:application_examples']} for more detailed examples.
  • Figure 2: K-MMStar Example
  • Figure 3: K-DTCBench Example
  • Figure 4: K-LLaVA-W Example
  • Figure 5: K-LLaVA-W Evaluation Prompt. We translated the LLaVA-W prompts and added specific guidelines in the JudgeLLM prompt.
  • ...and 10 more figures