Table of Contents
Fetching ...

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Wenqi Zhang, Zhenglin Cheng, Yuanyu He, Mengna Wang, Yongliang Shen, Zeqi Tan, Guiyang Hou, Mingqian He, Yanna Ma, Weiming Lu, Yueting Zhuang

TL;DR

<3-5 sentence high-level summary>The paper tackles the challenge that existing large multimodal models underperform on abstract image understanding and visual reasoning tasks such as charts, maps, and dashboards. It introduces a code-centric, LLM-driven self-instruct pipeline that autonomously proposes visual ideas, generates simulated data and plotting code, and crafts reasoning questions with rationales, enabling the rapid creation of a large synthetic benchmark (11,193 instructions across eight scenarios) and focused fine-tuning data (62,476 chart/table/map instructions). The authors demonstrate that current LMMs exhibit substantial gaps on the benchmark, and show that fine-tuning Llava-1.5-7B with synthetic data significantly improves chart interpretation and map navigation, with positive cross-task synergies and some generalization to untrained tasks. These results suggest that synthetic, programmatic image generation paired with instruction tuning can meaningfully advance abstract visual reasoning in multimodal systems, with practical implications for AI assistants in data analysis and navigation tasks.

Abstract

Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we design a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. \textbf{This benchmark, constructed with simple lines and geometric elements, exposes the shortcomings of most advanced LMMs} like Claude-3.5-Sonnet and GPT-4o in abstract image understanding, spatial relations reasoning, and visual element induction. Besides, to verify the quality of our synthetic data, we fine-tune an LMM using 62,476 synthetic chart, table and road map instructions. The results demonstrate improved chart understanding and map navigation performance, and also demonstrate potential benefits for other visual reasoning tasks. Our code is available at: \url{https://github.com/zwq2018/Multi-modal-Self-instruct}.

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

TL;DR

<3-5 sentence high-level summary>The paper tackles the challenge that existing large multimodal models underperform on abstract image understanding and visual reasoning tasks such as charts, maps, and dashboards. It introduces a code-centric, LLM-driven self-instruct pipeline that autonomously proposes visual ideas, generates simulated data and plotting code, and crafts reasoning questions with rationales, enabling the rapid creation of a large synthetic benchmark (11,193 instructions across eight scenarios) and focused fine-tuning data (62,476 chart/table/map instructions). The authors demonstrate that current LMMs exhibit substantial gaps on the benchmark, and show that fine-tuning Llava-1.5-7B with synthetic data significantly improves chart interpretation and map navigation, with positive cross-task synergies and some generalization to untrained tasks. These results suggest that synthetic, programmatic image generation paired with instruction tuning can meaningfully advance abstract visual reasoning in multimodal systems, with practical implications for AI assistants in data analysis and navigation tasks.

Abstract

Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we design a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. \textbf{This benchmark, constructed with simple lines and geometric elements, exposes the shortcomings of most advanced LMMs} like Claude-3.5-Sonnet and GPT-4o in abstract image understanding, spatial relations reasoning, and visual element induction. Besides, to verify the quality of our synthetic data, we fine-tune an LMM using 62,476 synthetic chart, table and road map instructions. The results demonstrate improved chart understanding and map navigation performance, and also demonstrate potential benefits for other visual reasoning tasks. Our code is available at: \url{https://github.com/zwq2018/Multi-modal-Self-instruct}.
Paper Structure (40 sections, 14 figures, 7 tables)

This paper contains 40 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Benchmarking Leading LMMs on abstract image understanding and reasoning tasks.
  • Figure 2: We leverage LLM and code to synthesize abstract images and self-instruct diverse reasoning instructions, e.g., charts, road maps, dashboards, visual puzzles, and relation graphs. Unlike natural landscapes and human photos, these non-natural images constructed with geometric elements require stronger perception and spatial relation reasoning. Our benchmark indicates that current LMMs are far from human-level performance. They even fail to complete simple daily tasks, e.g., reading the time on a clock or planning a route using a map.
  • Figure 3: Our multi-modal self-instruct strategy first self-proposes a visual idea to depict an abstract image. Based on this, the LLM generates simulated data and writes code to create the drawings. Subsequently, LLM is instructed to design multiple Q&A based on the code and idea, covering various aspects such as spatial reasoning, color recognition, and mathematical reasoning, constructing a rich set of multimodal instructions.
  • Figure 4: Left: The distribution of different chart types. Right: The number of questions for each category.
  • Figure 5: Top: We present three examples of road maps with different path complexity. Bottom: We categorize all maps into five levels of complexity.
  • ...and 9 more figures