Table of Contents
Fetching ...

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao

TL;DR

Agent0-VL introduces a self-evolving vision-language agent that unifies reasoning, verification, and self-repair through a Solver-Verifier architecture and a Self-Evolving Reasoning Cycle (SERC). By grounding both reasoning and evaluation in external tools and a zero external reward loop, it achieves continual improvement via RL (GRPO) guided by tool-grounded feedback. Empirically, it delivers substantial gains over open-source baselines across math and vision-heavy benchmarks, with robust improvements when used as a process reward model for other LVLMs. The approach demonstrates that integrated tool usage and structured self-evaluation can yield stable, multi-iteration performance growth in multimodal reasoning tasks.

Abstract

Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0.

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

TL;DR

Agent0-VL introduces a self-evolving vision-language agent that unifies reasoning, verification, and self-repair through a Solver-Verifier architecture and a Self-Evolving Reasoning Cycle (SERC). By grounding both reasoning and evaluation in external tools and a zero external reward loop, it achieves continual improvement via RL (GRPO) guided by tool-grounded feedback. Empirically, it delivers substantial gains over open-source baselines across math and vision-heavy benchmarks, with robust improvements when used as a process reward model for other LVLMs. The approach demonstrates that integrated tool usage and structured self-evaluation can yield stable, multi-iteration performance growth in multimodal reasoning tasks.

Abstract

Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0.

Paper Structure

This paper contains 23 sections, 9 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: The evolve loop of Agent0-VL and its performance comparison. The left part illustrates the iterative evolution between the Solver and Verifier, where the Solver progressively refines reasoning strategies under Verifier feedback. The right part presents results showing that Agent0-VL outperforms tool-integrated reasoning methods across multiple representative benchmarks. TIR: Tool-Integrated Reasoning.
  • Figure 2: The Framework of Agent0-VL. The unified policy $\pi_\theta$ alternates between two internal roles: the Solver that generates reasoning trajectories with tool calls, and the Verifier that performs generative verification using tool feedback to produce critiques and step-wise rewards. These roles are jointly optimized through the Self-Evolving Reasoning Cycle, where self-generated rewards guide policy updates via RL.
  • Figure 3: The overall Best-of-8 evaluation results across seven multimodal reasoning benchmarks with different critic models. Our model greatly enhances the overall performance compared with Qwen2.5-VL-7B model.
  • Figure 4: Simplified illustration of Agent0-VL’s self-evolving reasoning process on a geometric reasoning task during training phase. The model first produces an incorrect answer (Phase 1), after which the Verifier identifies the logical error (Phase 2), triggers Self-Repair to generate a correction (Phase 3), and finally re-executes reasoning via the Solver to reach the correct solution (Phase 4). The complete multi-phase case is provided in Appendix \ref{['sec:case']} (Figure \ref{['fig:case_5']}).
  • Figure 5: System prompt for the Solver.
  • ...and 7 more figures