Table of Contents
Fetching ...

Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding

Junyi Ye, Ankan Dash, Wenpeng Yin, Guiling Wang

TL;DR

The paper tackles the limited controllability and explainability of end-to-end flowchart understanding by introducing TextFlow, a dual-stage framework that first converts flowchart images into textual representations (Vision Textualizer) and then reasons over that text (Textual Reasoner). By supporting multiple textual formats (Graphviz, Mermaid, PlantUML) and enabling executable graph tools, TextFlow improves performance and interpretability, achieving state-of-the-art results on FlowVQA and FlowLearn (e.g., $82.74\%$ vs $76.61\%$ baselines). Through extensive analysis, the authors show Graphviz as the most effective representation, demonstrate robustness across sources, orientations, and sizes, and reveal that errors primarily stem from the extraction stage rather than reasoning. The work highlights the value of modular, controllable pipelines that leverage powerful LLMs while maintaining explainability, with practical implications for scalable and transparent flowchart understanding in real-world applications.

Abstract

Flowcharts are typically presented as images, driving the trend of using vision-language models (VLMs) for end-to-end flowchart understanding. However, two key challenges arise: (i) Limited controllability--users have minimal influence over the downstream task, as they can only modify input images, while the training of VLMs is often out of reach for most researchers. (ii) Lack of explainability--it is difficult to trace VLM errors to specific causes, such as failures in visual encoding or reasoning. We propose TextFlow, addressing aforementioned issues with two stages: (i) Vision Textualizer--which generates textual representations from flowchart images; and (ii) Textual Reasoner--which performs question-answering based on the text representations. TextFlow offers three key advantages: (i) users can select the type of text representations (e.g., Graphviz, Mermaid, PlantUML), or further convert them into executable graph object to call tools, enhancing performance and controllability; (ii) it improves explainability by helping to attribute errors more clearly to visual or textual processing components; and (iii) it promotes the modularization of the solution, such as allowing advanced LLMs to be used in the Reasoner stage when VLMs underperform in end-to-end fashion. Experiments on the FlowVQA and FlowLearn benchmarks demonstrate TextFlow's state-of-the-art performance as well as its robustness. All code is publicly available.

Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding

TL;DR

The paper tackles the limited controllability and explainability of end-to-end flowchart understanding by introducing TextFlow, a dual-stage framework that first converts flowchart images into textual representations (Vision Textualizer) and then reasons over that text (Textual Reasoner). By supporting multiple textual formats (Graphviz, Mermaid, PlantUML) and enabling executable graph tools, TextFlow improves performance and interpretability, achieving state-of-the-art results on FlowVQA and FlowLearn (e.g., vs baselines). Through extensive analysis, the authors show Graphviz as the most effective representation, demonstrate robustness across sources, orientations, and sizes, and reveal that errors primarily stem from the extraction stage rather than reasoning. The work highlights the value of modular, controllable pipelines that leverage powerful LLMs while maintaining explainability, with practical implications for scalable and transparent flowchart understanding in real-world applications.

Abstract

Flowcharts are typically presented as images, driving the trend of using vision-language models (VLMs) for end-to-end flowchart understanding. However, two key challenges arise: (i) Limited controllability--users have minimal influence over the downstream task, as they can only modify input images, while the training of VLMs is often out of reach for most researchers. (ii) Lack of explainability--it is difficult to trace VLM errors to specific causes, such as failures in visual encoding or reasoning. We propose TextFlow, addressing aforementioned issues with two stages: (i) Vision Textualizer--which generates textual representations from flowchart images; and (ii) Textual Reasoner--which performs question-answering based on the text representations. TextFlow offers three key advantages: (i) users can select the type of text representations (e.g., Graphviz, Mermaid, PlantUML), or further convert them into executable graph object to call tools, enhancing performance and controllability; (ii) it improves explainability by helping to attribute errors more clearly to visual or textual processing components; and (iii) it promotes the modularization of the solution, such as allowing advanced LLMs to be used in the Reasoner stage when VLMs underperform in end-to-end fashion. Experiments on the FlowVQA and FlowLearn benchmarks demonstrate TextFlow's state-of-the-art performance as well as its robustness. All code is publicly available.

Paper Structure

This paper contains 35 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Our dual-stage TextFlow vs. prior work.
  • Figure 2: Comparison of GPT-4o’s performance on Top-Down and Bottom-Up flowchart configurations.
  • Figure 3: Accuracy comparison by node count across various models with a rolling average. VQA and the top 5 performing Reasoners on TextFlow using extracted Mermaid in Table \ref{['tab:textrepresentation']} are compared.
  • Figure 4: Error analysis for percentage of errors attributed to each category in Claude 3.5 Sonnet.