Table of Contents
Fetching ...

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Tongkun Guan, Zhibo Yang, Jianqiang Wan, Mingkun Yang, Zhengtao Guo, Zijian Hu, Ruilin Luo, Ruize Chen, Songtao Jiang, Peng Wang, Wei Shen, Junyang Lin, Xiaokang Yang

TL;DR

This work constructs ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm, and introduces STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains.

Abstract

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

TL;DR

This work constructs ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm, and introduces STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains.

Abstract

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.
Paper Structure (23 sections, 12 equations, 10 figures, 5 tables)

This paper contains 23 sections, 12 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: The scaling analysis reveals perception as the bottleneck in STEM. We decouple visual STEM reasoning into perception (image-to-caption) and reasoning (caption-to-answer) stages, then independently scale each component while keeping the other constant. Left: (Blue) Perception@4B + Reasoning@4/8/32B-scaled; (Red) Reasoning@4B + Perception@4/8/32B-scaled. Right: (Blue) Perception@8B + Reasoning@4/8/32B-scaled; (Red) Reasoning@8B + Perception@4/8/32B-scaled. All components use Qwen3-VL-Thinking Qwen3VL_github models and evaluation in the representative MathVision Dataset MathVision. Both experiments demonstrate that scaling perception consistently outperforms scaling reasoning. This finding motivates our focus on systematically enhancing MLLMs' perception capabilities in STEM.
  • Figure 2: The overview pipeline of CodePercept that enhances MLLMs' visual perception in STEM domains through code-grounded learning. (Part 01) Starting from public STEM data, we construct high-quality image-code pairs via three complementary pipelines: (1) Image Reproduce converts existing STEM images into executable Python codes, (2) Image Diversity extracts concepts from seed images and generates diverse instantiations while preserving semantic validity, and (3) Solid Geometry employs parametric templates to generate complex solid geometry images with corresponding codes, addressing MLLMs' solid geometry limitations. (Part 02) The synthesized data enables two novel training tasks (Code-Grounded Caption Generation and STEM Image-to-Code Translation) that fundamentally shift how we approach visual perception. (Part 03) These processes culminate in ICC-1M, a dataset of over 1M curated image-caption-code triplets. We employ both Supervised Finetuning and Reinforcement Learning to train models that achieve robust visual perception capabilities.
  • Figure 3: The hexagonal grid with a spiraling path, generated by the Python code below.
  • Figure 4: A representative image, where poses a significant challenge to the perceptual abilities of current MLLMs.
  • Figure 5: The training curves of our proposed models. (a) the curves of CodePercept-S1 models. (b) the curves of CodePercept-R1 models.
  • ...and 5 more figures