An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation
Giang Son Nguyen, Zi Pong Lim, Sarthak Ketanbhai Modi, Yon Shin Teo, Wenya Wang
TL;DR
This work tackles the challenge of evaluating flowchart image-to-code generation in production where no ground-truth code is available. It introduces a reference-free framework built from Recall$_{OCR}$, measuring content coverage via OCR, and Precision$_{VE}$, detecting hallucinations via Visual Entailment, combined into F1$_{OCR-VE}$. The method is model- and language-agnostic, online-capable, and provides interpretable diagnostics to aid error triage, achieving strong alignment with ground-truth metrics on FlowVQA (e.g., $r=0.967$, $0.910$, and $0.939$ for Recall$_{OCR}$, Precision$_{VE}$, and F1$_{OCR-VE}$ respectively). Practically, this framework enables continuous production monitoring and modular upgrades of OCR and VE components to maintain output quality without annotation overhead.
Abstract
Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.
