First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

Enming Zhang; Ruobing Yao; Huanyong Liu; Junhui Yu; Jiale Wang

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

Enming Zhang, Ruobing Yao, Huanyong Liu, Junhui Yu, Jiale Wang

TL;DR

This work proposes the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts, which encompasses evaluating MLLMs' abilities in Reasoning, Localization Recognition, Information Extraction, Logical Verification, and Summarization on flowcharts.

Abstract

With the development of Multimodal Large Language Models (MLLMs) technology, its general capabilities are increasingly powerful. To evaluate the various abilities of MLLMs, numerous evaluation systems have emerged. But now there is still a lack of a comprehensive method to evaluate MLLMs in the tasks related to flowcharts, which are very important in daily life and work. We propose the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts. It encompasses evaluating MLLMs' abilities in Reasoning, Localization Recognition, Information Extraction, Logical Verification, and Summarization on flowcharts. However, we find that even the GPT4o model achieves only a score of 56.63. Among open-source models, Phi-3-Vision obtained the highest score of 49.97. We hope that FlowCE can contribute to future research on MLLMs for tasks based on flowcharts. \url{https://github.com/360AILABNLP/FlowCE}

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 17 figures, 9 tables)

This paper contains 24 sections, 3 equations, 17 figures, 9 tables.

Introduction
Related Work
Multimodal Large Language Models
Benchmarks for MLLMs
FlowCE
Tasks across different dimensions
Data construction
Evaluation method
Experiments
Experimental setups
Evaluation results
Further analysis
Model parameter volume
Model data volume
Consensus between Humans and Evaluators
...and 9 more sections

Figures (17)

Figure 1: Evaluation results of multimodal large language models on five dimensions of tasks in FlowCE. GPT-4o achieves the highest overall score of 56.63.
Figure 2: The process of creating and evaluating FlowCE.
Figure 3: Data samples of FlowCE, which covers 5 evaluation dimensions. Each evaluation dimension contains human-annotated question-answer pairs.
Figure 4: Flowchart Type Distribution Across Varied Categories.
Figure 5: Resolution Distribution in Flowchart Representation.
...and 12 more figures

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

TL;DR

Abstract

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (17)