Table of Contents
Fetching ...

See it. Say it. Sorted: Agentic System for Compositional Diagram Generation

Hantao Zhang, Jingyang Liu, Ed Li

TL;DR

Sketch-to-diagram generation is difficult for maintaining structure and symbolic layout with diffusion-based methods. The authors present See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model as critic with multiple Large Language Models to iteratively generate and select SVG program updates, preserving layout and connectivity. The key contributions are the Critic–Candidates–Judge loop, qualitative relational reasoning, and the production of editable vector graphics suitable for real-world tools via APIs. Empirical results on 10 flowchart-inspired sketches show improved structural fidelity over frontier image LLMs, with accurate primitive composition and avoidance of extraneous text. This approach offers a scalable, adaptable foundation for compositional graphics design in practical workflows, with potential extensions to 3D and CAD tasks as models advance.

Abstract

We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.

See it. Say it. Sorted: Agentic System for Compositional Diagram Generation

TL;DR

Sketch-to-diagram generation is difficult for maintaining structure and symbolic layout with diffusion-based methods. The authors present See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model as critic with multiple Large Language Models to iteratively generate and select SVG program updates, preserving layout and connectivity. The key contributions are the Critic–Candidates–Judge loop, qualitative relational reasoning, and the production of editable vector graphics suitable for real-world tools via APIs. Empirical results on 10 flowchart-inspired sketches show improved structural fidelity over frontier image LLMs, with accurate primitive composition and avoidance of extraneous text. This approach offers a scalable, adaptable foundation for compositional graphics design in practical workflows, with potential extensions to 3D and CAD tasks as models advance.

Abstract

We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.

Paper Structure

This paper contains 10 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Given a sketch of a flow chart, our agent faithfully reconstructs the intended structure while strictly following the accompanying text instructions. The system operates in an iterative Critic–Candidates–Judge loop: a VLM critiques discrepancies between the sketch and the current diagram, multiple LLMs propose diverse SVG modifications, and a Judge VLM selects the best candidate. This training-free framework enables accurate, controllable, and editable diagram generation, moving beyond pixel-level synthesis toward structured programmatic outputs.
  • Figure 2: Pipeline of See it. Say it. Sorted. for one optimization step: The Critic VLM compares the target sketch and the current image and identifies few small modifications, passing to LLM. Given VLM's instructions, LLM generates several candidates that balances exploration-exploitation trade-off. These candidates are rendered and evaluated together with the current image by the Judge VLM. Judge VLM decides which one best reconstruct the sketch. If the current image is chosen, then the optimization is reverted and Critic VLM receives feedback of failed modifications. If one of the candidates is chosen then proceed to next optimization step.
  • Figure 3: Our agent faithfully generates flow charts based on the sketch within 3 optimization steps, significantly outperforming GPT-5 and Gemini-2.5-Pro at preserving the structure and characteristics of the diagram. For the complete comparison for all 10 tasks, see Appendix \ref{['tab:comparison_1']}.