Table of Contents
Fetching ...

Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

Jiawei Zhou, Chi Zhang, Xiang Feng, Qiming Zhang, Haibo Qiu, Lihuo He, Dengpan Ye, Xinbo Gao, Jing Zhang

Abstract

We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception -- to parse intricate spatial hierarchies and symbolic details -- and precise generative expression -- to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content -- from scientific visualizations to complex symbolic notations -- each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at https://github.com/MiliLab/Omni-I2C.

Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

Abstract

We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception -- to parse intricate spatial hierarchies and symbolic details -- and precise generative expression -- to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content -- from scientific visualizations to complex symbolic notations -- each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at https://github.com/MiliLab/Omni-I2C.
Paper Structure (56 sections, 3 equations, 26 figures, 11 tables)

This paper contains 56 sections, 3 equations, 26 figures, 11 tables.

Figures (26)

  • Figure 1: Taxonomy overview of the Omni-I2C benchmark. The dataset features a diverse distribution across 5 code types, 8 major subjects, and 45 distinct figure types.
  • Figure 2: The Data Curation and pipeline of Omni-I2C. Data Curation (Left): From raw web collection to a refined benchmark with diverse themes and languages, ensured by strict filtering and human verification. Pipeline (Right): The proposed workflow for image-to-code generation and automatic evaluation.
  • Figure 3: Representative examples from Gemini 3 Pro.
  • Figure 4: Comparative Analysis of Error Patterns in Claude Sonnet 4.5, Gemini 3 Pro, and Qwen3-VL-235B-A22B-Instruct. Case studies and evaluations across various languages are detailed in the App. \ref{['sec:appendix_err']}.
  • Figure 5: Examples of LLM-based code restructuring with foreground and numerical perturbations.
  • ...and 21 more figures