Table of Contents
Fetching ...

Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs

Zixin Wen, Yifu Cai, Kyle Lee, Sam Estep, Josh Sunshine, Aarti Singh, Yuejie Chi, Wode Ni

Abstract

Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (''ideas'') and performs code planning based on the ideas. Given the plan, Feynman translates ideas into simple declarative programs and iterates to receives feedback and visually refine diagrams. Finally, the declarative programs are rendered by the Penrose diagramming system. The optimization-based rendering of Penrose preserves the visual semantics while injecting fresh randomness into the layout, thereby producing diagrams with visual consistency and diversity. As a result, Feynman can author diagrams along with grounded captions with very little cost and time. Using Feynman, we synthesized a dataset with more than 100k well-aligned diagram-caption pairs. We also curate a visual-language benchmark, Diagramma, from freshly generated data. Diagramma can be used for evaluating the visual reasoning capabilities of vision-language models. We plan to release the dataset, benchmark, and the full agent pipeline as an open-source project.

Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs

Abstract

Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (''ideas'') and performs code planning based on the ideas. Given the plan, Feynman translates ideas into simple declarative programs and iterates to receives feedback and visually refine diagrams. Finally, the declarative programs are rendered by the Penrose diagramming system. The optimization-based rendering of Penrose preserves the visual semantics while injecting fresh randomness into the layout, thereby producing diagrams with visual consistency and diversity. As a result, Feynman can author diagrams along with grounded captions with very little cost and time. Using Feynman, we synthesized a dataset with more than 100k well-aligned diagram-caption pairs. We also curate a visual-language benchmark, Diagramma, from freshly generated data. Diagramma can be used for evaluating the visual reasoning capabilities of vision-language models. We plan to release the dataset, benchmark, and the full agent pipeline as an open-source project.
Paper Structure (53 sections, 5 equations, 16 figures, 4 tables, 2 algorithms)

This paper contains 53 sections, 5 equations, 16 figures, 4 tables, 2 algorithms.

Figures (16)

  • Figure 1: The Feynman Agent
  • Figure 2: Iterate Step: At each step, Feynman attempts to write Penrose program to create a diagram. The generated program is then compiled into images and sent to a panel of visual judges (MLLMs) for critical feedback. We term this algorithm Iterative Visual-Refine (\ref{['alg:iterative-refine']}).
  • Figure 3: Examples of conceptual diagrams and theirSubstancenotations: a graph where node connections form a cube (left) and the Lewis structure of the formaldehyde molecule ($\mathrm{CH_2O}$).
  • Figure 4: Feynman generates programs that Penrose compiles to generate an layout optimization problem. The Penrose layout engine then solves the optimization problem.
  • Figure 5: Diverse visual layouts of Penrose diagram variations: using the same Substance, Penrose can produce diagram variations while preserving the semantics, by sampling random initial values for shapes, colors, and other numerical quantities in the diagram. We show 4 random seed for 4 Substance programs for (A) ray-tracing diagrams, (B) Cayley graphs, (C) Chaos game as a Sierpinski triangle, and (D) Euler diagrams for sets.
  • ...and 11 more figures