JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models
Hiroshi Sasaki
TL;DR
JSynFlow addresses the challenge of flowchart QA by synthesising a large-scale Japanese flowchart dataset with LLMs, using a DSL (Mermaid) to render flowcharts and generating corresponding QA pairs. Through LoRA-based fine-tuning, the authors show improved QA performance for VLMs on flowchart tasks, validating the utility of synthetic, occupation-grounded data. The dataset spans 9 industries, 115 occupations, 1,511 tasks and 11,137 QA pairs, and is publicly available at HuggingFace. The work demonstrates the practicality of automated, domain-specific data synthesis for enhancing specialized VLM capabilities in diagram understanding.
Abstract
Vision and language models (VLMs) are expected to analyse complex documents, such as those containing flowcharts, through a question-answering (QA) interface. The ability to recognise and interpret these flowcharts is in high demand, as they provide valuable insights unavailable in text-only explanations. However, developing VLMs with precise flowchart understanding requires large-scale datasets of flowchart images and corresponding text, the creation of which is highly time-consuming. To address this challenge, we introduce JSynFlow, a synthesised visual QA dataset for Japanese flowcharts, generated using large language models (LLMs). Our dataset comprises task descriptions for various business occupations, the corresponding flowchart images rendered from domain-specific language (DSL) code, and related QA pairs. This paper details the dataset's synthesis procedure and demonstrates that fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. Our dataset is publicly available at https://huggingface.co/datasets/jri-advtechlab/jsynflow.
