Table of Contents
Fetching ...

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang

TL;DR

SciFlow-Bench is introduced, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs that enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning.

Abstract

Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

TL;DR

SciFlow-Bench is introduced, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs that enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning.

Abstract

Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.
Paper Structure (59 sections, 6 figures, 4 tables)

This paper contains 59 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Motivation and comparison between existing diagram benchmarks and SciFlow-Bench. Left: Image-centric evaluation may assign high scores to visually plausible diagrams that contain structural errors. Right: SciFlow-Bench adopts a structure-first evaluation by inverse-parsing generated diagrams into graphs and measuring structural recoverability.
  • Figure 2: Overview of the SciFlow-Bench framework. A unified round-trip evaluation pipeline based on a hierarchical multi-agent system that constructs canonical ground-truth graphs from source framework figures and recovers predicted graphs from generated diagram images. By inverse-parsing pixel-level outputs into structured graphs, the same pipeline supports both dataset construction and structure-first evaluation via structural recoverability.
  • Figure 3: Benchmark statistics of SciFlow-Bench. (a) Domain distribution across research areas. (b) Distribution of structural difficulty levels defined by graph size and relational complexity. (c) Long-tailed distributions of node and edge counts, reflecting substantial structural complexity in real-world scientific diagrams.
  • Figure 4: Human verification interface and representative annotation outcomes. Annotators refine the automatically extracted graph by selectively excluding unsupported components and adding missing nodes or relations under a minimal-intervention, identity-consistent editing protocol.
  • Figure 5: Representative discrepancy cases observed during human verification. Examples illustrate common sources of mismatch between automatic extraction and human-verified ground-truth graphs, including ambiguous or implicit connections, text-related perception errors, and structurally ambiguous relations inherent in real scientific diagrams. Figures are reproduced from prior work qi2025cimflowchen2025structuredzhang2025occupancy.
  • ...and 1 more figures