Table of Contents
Fetching ...

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

Issa Sugiura, Keito Sasagawa, Keisuke Nakao, Koki Maeda, Ziqi Yin, Zhishen Yang, Shuhei Kurita, Yusuke Oda, Ryoko Tokuhisa, Daisuke Kawahara, Naoaki Okazaki

Abstract

Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

Abstract

Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.

Paper Structure

This paper contains 19 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Construction pipeline of Jagle. Our pipeline leverages diverse data sources, including images, image-text pairs, and PDF corpora, and integrates multiple QA generation strategies such as VLM-based QA generation, translation, OCR-based text extraction, text rendering, and direct utilization of existing data to produce VQA samples.
  • Figure 2: Category distribution of Jagle across four metrics: number of samples, unique images, turns, and answer tokens.
  • Figure 3: t-SNE visualization of SigLIP2 image embeddings for 5,000 images randomly sampled from Jagle. Chart & Table and Native OCR images form distinct clusters, while General VQA, Captioning, and OCR QA images are largely intermingled.
  • Figure 4: Representative VQA examples from each category in Jagle. The dataset covers a wide variety of visual content, including natural images, charts and tables, document images, and presentation slides.
  • Figure 5: Training dynamics under each data setting for the macro-averaged score over all 21 tasks (Avg), 10 Japanese tasks (JA Avg), and 10 English tasks (EN Avg). The model trained on Jagle outperforms the model trained on FineVision by over 20 points on JA Avg.
  • ...and 2 more figures