SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Tong Zhang; Honglin Lin; Zhou Liu; Chong Chen; Wentao Zhang

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang

TL;DR

SciFlow-Bench is introduced, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs that enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning.

Abstract

Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

TL;DR

Abstract

Paper Structure (59 sections, 6 figures, 4 tables)

This paper contains 59 sections, 6 figures, 4 tables.

Introduction
Related Work
SciFlow-Bench Framework
Problem Formulation and Overview
Hierarchical Multi-Agent Pipeline
Structure-Aware Evaluation
Dataset and Validation
Data Collection
Annotation Quality and Reliability
Benchmark Statistics and Coverage
Comparison with Prior Benchmarks
Experiments
Experimental Setup
Main Results
Structural Performance Regimes
...and 44 more sections

Figures (6)

Figure 1: Motivation and comparison between existing diagram benchmarks and SciFlow-Bench. Left: Image-centric evaluation may assign high scores to visually plausible diagrams that contain structural errors. Right: SciFlow-Bench adopts a structure-first evaluation by inverse-parsing generated diagrams into graphs and measuring structural recoverability.
Figure 2: Overview of the SciFlow-Bench framework. A unified round-trip evaluation pipeline based on a hierarchical multi-agent system that constructs canonical ground-truth graphs from source framework figures and recovers predicted graphs from generated diagram images. By inverse-parsing pixel-level outputs into structured graphs, the same pipeline supports both dataset construction and structure-first evaluation via structural recoverability.
Figure 3: Benchmark statistics of SciFlow-Bench. (a) Domain distribution across research areas. (b) Distribution of structural difficulty levels defined by graph size and relational complexity. (c) Long-tailed distributions of node and edge counts, reflecting substantial structural complexity in real-world scientific diagrams.
Figure 4: Human verification interface and representative annotation outcomes. Annotators refine the automatically extracted graph by selectively excluding unsupported components and adding missing nodes or relations under a minimal-intervention, identity-consistent editing protocol.
Figure 5: Representative discrepancy cases observed during human verification. Examples illustrate common sources of mismatch between automatic extraction and human-verified ground-truth graphs, including ambiguous or implicit connections, text-related perception errors, and structurally ambiguous relations inherent in real scientific diagrams. Figures are reproduced from prior work qi2025cimflowchen2025structuredzhang2025occupancy.
...and 1 more figures

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

TL;DR

Abstract

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)