Table of Contents
Fetching ...

No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

Damiano Marsili, Georgia Gkioxari

TL;DR

VALOR introduces an annotation-free framework for visual reasoning that jointly tunes reasoning and grounding using multimodal verifiers: an LLM verifier refines plan-based reasoning through reinforcement learning, and a VLM verifier strengthens visual grounding via hard-negative mining and pseudo-label generation. By leveraging unlabeled image–query pairs and a specialized reward model, VALOR achieves substantial improvements over open-source LLM baselines, RL-tuned VLMs, and program-synthesis methods across diverse spatial benchmarks. The results demonstrate scalable, data-efficient learning for complex 3D spatial reasoning and grounding, with grounding improvements further enhancing performance on grounding-sensitive tasks. The work highlights the value of verifiers as reliable feedback sources to guide annotation-free training in multimodal reasoning systems.

Abstract

Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/

No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

TL;DR

VALOR introduces an annotation-free framework for visual reasoning that jointly tunes reasoning and grounding using multimodal verifiers: an LLM verifier refines plan-based reasoning through reinforcement learning, and a VLM verifier strengthens visual grounding via hard-negative mining and pseudo-label generation. By leveraging unlabeled image–query pairs and a specialized reward model, VALOR achieves substantial improvements over open-source LLM baselines, RL-tuned VLMs, and program-synthesis methods across diverse spatial benchmarks. The results demonstrate scalable, data-efficient learning for complex 3D spatial reasoning and grounding, with grounding improvements further enhancing performance on grounding-sensitive tasks. The work highlights the value of verifiers as reliable feedback sources to guide annotation-free training in multimodal reasoning systems.

Abstract

Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/

Paper Structure

This paper contains 32 sections, 4 equations, 21 figures, 11 tables.

Figures (21)

  • Figure 1: Visual reasoning relies on accurate reasoning and visual grounding. To tackle the task, we propose an annotation-free training paradigm, called VALOR, that learns to decompose the task and invoke tools by leveraging multimodal verifiers, without the need of ground truth supervision.
  • Figure 2: Method overview. Our method, VALOR, tackles visual reasoning across a broad range of spatial tasks, in 2D and 3D. During training, LLM verifiers are used to improve reasoning via RL while VLM verifiers serve as critics to tune vision grounding models via SFT.
  • Figure 3: (a) LLM verifiers reward semantic correctness by evaluating logical correctness, object attribute and spatial relationship consideration, and if the code adheres to the predicted plan. Python interpreters check format adherence and syntax. (b) VLM verifiers refine visual grounding over-predictions through three stages, generating pseudo-labels from spatial reasoning queries.
  • Figure 4: Outputs for VALOR, LLMs with tools, and RL-tuned VLMs. For each example, we show the image, query, and model output. We recommend zooming in to read the outputs.
  • Figure 5: Visual grounding in VALOR, GRIT and ViGoRL.
  • ...and 16 more figures