Table of Contents
Fetching ...

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Le Zhuo, Songhao Han, Yuandong Pu, Boxiang Qiu, Sayak Paul, Yue Liao, Yihao Liu, Jie Shao, Xi Chen, Si Liu, Hongsheng Li

TL;DR

This work addresses the critical challenge of factual fidelity in structured visuals, proposing a holistic solution that combines a large-scale code-aligned dataset with chain-of-thought reasoning, a unified multimodal model built on FLUX.1 Kontext augmented by a lightweight Qwen-VL connector, and a dedicated benchmark StructBench with StructScore to rigorously evaluate fine-grained factual accuracy. The approach employs a three-stage training curriculum and an external reasoner at inference to boost planning and reasoning, achieving notable improvements in both generation and editing of structured visuals. By releasing the dataset, model, and benchmark, the work advances open-source capabilities for unified multimodal foundations that can handle precise, domain-specific visuals essential for scientific and technical domains.

Abstract

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q\&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

TL;DR

This work addresses the critical challenge of factual fidelity in structured visuals, proposing a holistic solution that combines a large-scale code-aligned dataset with chain-of-thought reasoning, a unified multimodal model built on FLUX.1 Kontext augmented by a lightweight Qwen-VL connector, and a dedicated benchmark StructBench with StructScore to rigorously evaluate fine-grained factual accuracy. The approach employs a three-stage training curriculum and an external reasoner at inference to boost planning and reasoning, achieving notable improvements in both generation and editing of structured visuals. By releasing the dataset, model, and benchmark, the work advances open-source capabilities for unified multimodal foundations that can handle precise, domain-specific visuals essential for scientific and technical domains.

Abstract

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q\&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

Paper Structure

This paper contains 27 sections, 3 equations, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Overview of our work.Left: We showcase the diverse text-to-image (T2I) and editing examples from our dataset. In contrast to natural images, modeling structured visual demands sophisticated composition planning, strong multimodal understanding, and precise text rendering, as highlighted by the three key characteristics. Right: Our model demonstrates competitive performance against leading closed-source systems in both structured image generation and editing benchmarks.
  • Figure 2: Data construction pipeline. We prompt GPT-5 to extract salient features, then generate paired editing instructions from the source code and rendered image. The source code is modified according to the code-editing instructions. The target image rendered from modified code is passed through rule-based filters to ensure the overall quality of the constructed dataset.
  • Figure 3: Benchmark construction and evaluation workflow. (a) Benchmark construction: We cluster the data into six categories, and for each editing and text-to-image (T2I) example, GPT-5 generates detailed image descriptions that are transformed into Q&A pairs for evaluating diverse visual aspects. (b) Evaluation protocol: Using the Q&A pairs, GPT-5 is queried on generated images for open-ended responses, which are compared with ground-truth answers to yield a final score.
  • Figure 4: Comparison of the initial and revised atomic Q&A pairs. Initial Q&A pairs sometimes entangle multiple attributes, hindering unambiguous verification and accurate scoring. Enforcing atomicity, i.e., one attribute or relation per Q&A, substantially improves metric reliability.
  • Figure 5: Statistical analysis of our dataset (a-c) and benchmark (d-f).
  • ...and 16 more figures