Table of Contents
Fetching ...

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu

TL;DR

Code-Vision presents a visual-centric benchmark that requires Multimodal LLMs to translate flowcharts into correct programs, testing across basic programming, algorithms, and math. By constructing three datasets (HumanEval-V, Algorithm, MATH) and using a mermaid-based flowchart pipeline with rigorous test cases, the study reveals a substantial gap between proprietary and open-source systems, especially on hard problems. The results indicate that visual reasoning about logic is a distinct challenge not fully captured by existing benchmarks like MMCode or MathVista, with mermaid representations aiding simpler comprehension for many systems. The benchmark and accompanying data/code enable rigorous assessment of multimodal reasoning and code-generation capabilities, guiding future improvements in open-source approaches and visual reasoning understanding.

Abstract

This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges compared to other multimodal reasoning benchmarks MMCode and MathVista. We also explore the reason for the poor performance of the open-source models. All data and codes are available at https://github.com/wanghanbinpanda/CodeVision.

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

TL;DR

Code-Vision presents a visual-centric benchmark that requires Multimodal LLMs to translate flowcharts into correct programs, testing across basic programming, algorithms, and math. By constructing three datasets (HumanEval-V, Algorithm, MATH) and using a mermaid-based flowchart pipeline with rigorous test cases, the study reveals a substantial gap between proprietary and open-source systems, especially on hard problems. The results indicate that visual reasoning about logic is a distinct challenge not fully captured by existing benchmarks like MMCode or MathVista, with mermaid representations aiding simpler comprehension for many systems. The benchmark and accompanying data/code enable rigorous assessment of multimodal reasoning and code-generation capabilities, guiding future improvements in open-source approaches and visual reasoning understanding.

Abstract

This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges compared to other multimodal reasoning benchmarks MMCode and MathVista. We also explore the reason for the poor performance of the open-source models. All data and codes are available at https://github.com/wanghanbinpanda/CodeVision.

Paper Structure

This paper contains 19 sections, 1 equation, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of data examples from Code-Vision and MMCode. In MMCode, images serve a supplementary role, while in Code-Vision, images play a primary role.
  • Figure 2: Code-Vision Construction Method. The Method includes Data Collection, Flowchart Construction, and Test Cases Generation.
  • Figure 3: The Prompt Template Used for Flowchart Construction and Test Cases Generation. We provide an example of keeping the output of the model in the expected format. The example is in the Appendix \ref{['sec:prompt']}.
  • Figure 4: Comparision with MathVista Benchmark. All MLLMs perform similarly on MathVista but have large differences on Code-Vision. The diff is the performance of the model on Code-Vision minus the performance on MathVista. Detailed results are in appendix \ref{['sec:detail_results']}
  • Figure 5: Error Analysis of Proprietary Models and Open-source Models. We count the proportion of each error type in the code generated by proprietary and open-source Models.
  • ...and 1 more figures