Table of Contents
Fetching ...

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C. Hollon, Bryan Wang

TL;DR

CodeV introduces Tool-Aware Policy Optimization (TAPO), a process-guided RL framework that rewards faithful, evidence-grounded visual tool use rather than relying solely on final answers. By representing visual tools as executable Python blocks executed in a sandbox and coupling this with dense, step-level rewards for tool usefulness and evidence alignment, CodeV achieves strong performance across perception, visual search, and multimodal reasoning benchmarks while significantly increasing faithfulness. A two-stage training pipeline (SFT followed by TAPO-based RL) and a faithfulness evaluation protocol reveal that CodeV reduces unfaithful tool use and resists reward hacking, addressing reliability and interpretability concerns in agentic visual reasoning. The work also provides a thorough data/experimental setup, ablations, and a public release to facilitate reproduction and extension on broader tool ecosystems.

Abstract

Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

TL;DR

CodeV introduces Tool-Aware Policy Optimization (TAPO), a process-guided RL framework that rewards faithful, evidence-grounded visual tool use rather than relying solely on final answers. By representing visual tools as executable Python blocks executed in a sandbox and coupling this with dense, step-level rewards for tool usefulness and evidence alignment, CodeV achieves strong performance across perception, visual search, and multimodal reasoning benchmarks while significantly increasing faithfulness. A two-stage training pipeline (SFT followed by TAPO-based RL) and a faithfulness evaluation protocol reveal that CodeV reduces unfaithful tool use and resists reward hacking, addressing reliability and interpretability concerns in agentic visual reasoning. The work also provides a thorough data/experimental setup, ablations, and a public release to facilitate reproduction and extension on broader tool ecosystems.

Abstract

Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

Paper Structure

This paper contains 44 sections, 6 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: An example of visual agentic system generating "unfaithful" trajectory: the cropping tool is used at the wrong region with unfaithful analysis but leads to correct answer.
  • Figure 2: Faithfulness conditioned on correct answers in V* wu2024v Benchmark. For this visual search problem, crop is treated as the most effective tool use zhao2025pyvisionzheng2025deepeyessu2025pixelreasoner. Therefore, we define faithful tool use as cropped image from tool use capturing any target object and evaluate how many correct answers are also faithful, as shown in violet. For the remaining correct answers, the tool uses do not capture the target object and are treated as unfaithful, as shown in mint. Recent visual agents achieve high final-answer accuracy but fail to use tools faithfully. CodeV shows great improvement in faithfulness with no decrease in accuracy.
  • Figure 3: Overview of the CodeV rollout and Tool-Aware Policy Optimization (TAPO). The model processes an image $I$ and question Q pair, using tools like cropping to generate intermediate results for its final answer. Tool faithfulness will be scored by a reward model. For the tool like cropping, reward model will score $r^\mathrm{tool}$ based on the observability of the target object in the cropped image. The final answer correctness will be used as outcome reward. The policy VLM is fine-tuned with tool-aware policy optimization, a GRPO-style reinforcement learning approach. The policy VLM will conduct multiple rollouts for the same Q and $I$ with tool use. These rollouts will be scored by the hybrid reward system that combines faithfulness and correctness. Final reward will be normalized within the group and used to estimate relative advantage for the policy VLM to update.
  • Figure 4: Performance across primitive perception, visual search, reasoning and math benchmarks. All model other than GPT-4o are 7B model with proper setup of tool use.
  • Figure 5: Faithfulness comparison on V* and HRBench-4k benchmarks. The extremely low faithful tool use rate in zhang2025thyme results from low tool use rate and decorative tool use in chain of thought.
  • ...and 5 more figures