See it. Say it. Sorted: Agentic System for Compositional Diagram Generation
Hantao Zhang, Jingyang Liu, Ed Li
TL;DR
Sketch-to-diagram generation is difficult for maintaining structure and symbolic layout with diffusion-based methods. The authors present See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model as critic with multiple Large Language Models to iteratively generate and select SVG program updates, preserving layout and connectivity. The key contributions are the Critic–Candidates–Judge loop, qualitative relational reasoning, and the production of editable vector graphics suitable for real-world tools via APIs. Empirical results on 10 flowchart-inspired sketches show improved structural fidelity over frontier image LLMs, with accurate primitive composition and avoidance of extraneous text. This approach offers a scalable, adaptable foundation for compositional graphics design in practical workflows, with potential extensions to 3D and CAD tasks as models advance.
Abstract
We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.
