Table of Contents
Fetching ...

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, Yunqing Zhao

TL;DR

CodeDance reframes visual reasoning as executable, code-driven planning that dynamically selects and composes tools via a sandboxed interpreter. It combines a 34K high quality trajectory dataset with a difficulty-adaptive reward and GRPO based RL to encourage balanced tool usage and robust, multi-turn reasoning. Across visual reasoning, counting, chart QA, and math benchmarks, CodeDance outperforms schema-based baselines and several large models, and exhibits emergent behaviors such as novel tool compositions and cross-task transfer. The work demonstrates that executable code can serve as a scalable, verifiable medium for multimodal AI, while highlighting considerations for safety and real-world deployment.

Abstract

Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

TL;DR

CodeDance reframes visual reasoning as executable, code-driven planning that dynamically selects and composes tools via a sandboxed interpreter. It combines a 34K high quality trajectory dataset with a difficulty-adaptive reward and GRPO based RL to encourage balanced tool usage and robust, multi-turn reasoning. Across visual reasoning, counting, chart QA, and math benchmarks, CodeDance outperforms schema-based baselines and several large models, and exhibits emergent behaviors such as novel tool compositions and cross-task transfer. The work demonstrates that executable code can serve as a scalable, verifiable medium for multimodal AI, while highlighting considerations for safety and real-world deployment.

Abstract

Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

Paper Structure

This paper contains 30 sections, 11 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Motivation and effectiveness of dynamic tool invocation in CodeDance. Left: qualitative examples show that both tool underuse (unable to invoke tools on challenging tasks) and tool overuse (redundant calls on easy tasks) lead to hallucinated reasoning, incorrect answers, and unnecessary complexity (more reasoning turns and longer rollout time), whereas CodeDance dynamically invokes tools according to task difficulty to obtain correct solutions. DeepEyes denotes the model optimized using the same reward as DeepEyes. Right: quantitative results show CodeDance-7B consistently surpasses all Qwen2.5-VL 7B baselines and exceeds the 32B version on several tasks.
  • Figure 2: Overview of our framework that enables executable visual reasoning and invokes tool integration adaptively.
  • Figure 3: Intriguing reasoning trajectories emerge during RL. These behaviors are absent from the SFT data and arise from pretrained knowledge further shaped by RL, reflecting our design of adaptive tool invocation. These emergent patterns motivate our scaling study in \ref{['fig:scaling']}, where we further examine whether these capabilities strengthen with larger data, longer training, and bigger models.
  • Figure 4: Scaling up compute budget on four dimensions: dataset size for SFT, model capacity, max-turns during inference and RL steps.
  • Figure 5: Entropy and validation accuracy of model's generation.
  • ...and 9 more figures