Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

Hanbin Wang; Xiaoxuan Zhou; Zhipeng Xu; Keyuan Cheng; Yuxin Zuo; Kai Tian; Jingwei Song; Junting Lu; Wenhui Hu; Xueyang Liu

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu

TL;DR

Code-Vision presents a visual-centric benchmark that requires Multimodal LLMs to translate flowcharts into correct programs, testing across basic programming, algorithms, and math. By constructing three datasets (HumanEval-V, Algorithm, MATH) and using a mermaid-based flowchart pipeline with rigorous test cases, the study reveals a substantial gap between proprietary and open-source systems, especially on hard problems. The results indicate that visual reasoning about logic is a distinct challenge not fully captured by existing benchmarks like MMCode or MathVista, with mermaid representations aiding simpler comprehension for many systems. The benchmark and accompanying data/code enable rigorous assessment of multimodal reasoning and code-generation capabilities, guiding future improvements in open-source approaches and visual reasoning understanding.

Abstract

This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges compared to other multimodal reasoning benchmarks MMCode and MathVista. We also explore the reason for the poor performance of the open-source models. All data and codes are available at https://github.com/wanghanbinpanda/CodeVision.

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

TL;DR

Abstract

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)