Table of Contents
Fetching ...

An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation

Giang Son Nguyen, Zi Pong Lim, Sarthak Ketanbhai Modi, Yon Shin Teo, Wenya Wang

TL;DR

This work tackles the challenge of evaluating flowchart image-to-code generation in production where no ground-truth code is available. It introduces a reference-free framework built from Recall$_{OCR}$, measuring content coverage via OCR, and Precision$_{VE}$, detecting hallucinations via Visual Entailment, combined into F1$_{OCR-VE}$. The method is model- and language-agnostic, online-capable, and provides interpretable diagnostics to aid error triage, achieving strong alignment with ground-truth metrics on FlowVQA (e.g., $r=0.967$, $0.910$, and $0.939$ for Recall$_{OCR}$, Precision$_{VE}$, and F1$_{OCR-VE}$ respectively). Practically, this framework enables continuous production monitoring and modular upgrades of OCR and VE components to maintain output quality without annotation overhead.

Abstract

Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.

An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation

TL;DR

This work tackles the challenge of evaluating flowchart image-to-code generation in production where no ground-truth code is available. It introduces a reference-free framework built from Recall, measuring content coverage via OCR, and Precision, detecting hallucinations via Visual Entailment, combined into F1. The method is model- and language-agnostic, online-capable, and provides interpretable diagnostics to aid error triage, achieving strong alignment with ground-truth metrics on FlowVQA (e.g., , , and for Recall, Precision, and F1 respectively). Practically, this framework enables continuous production monitoring and modular upgrades of OCR and VE components to maintain output quality without annotation overhead.

Abstract

Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: , which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and , which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, , provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's , , and for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.
Paper Structure (32 sections, 5 equations, 1 figure, 5 tables)

This paper contains 32 sections, 5 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Scatter plots of $\text{Precision}_{\text{VE}}$ versus $\text{Precision}_{\text{Actual}}$ across VE models (rows) and VLMs (columns). Color intensity reflects the image-level F1 scores. Gemini 2.5 Pro shows the strongest alignment (highest correlation, lowest RMSE), reflecting its low error rates (FPR and FNR). Claude Sonnet 4.0 exhibits moderate correlation due to consistent overestimation, while Gemini 1.5 Pro shows the weakest correlation, driven by both over- and underestimation tendencies.