Table of Contents
Fetching ...

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan

TL;DR

<3-5 sentence high-level summary>V-Triune introduces a unified reinforcement learning framework for post-training vision-language models that jointly optimizes visual reasoning and perception. It decomposes the system into sample-level data formatting, verifier-level reward computation, and source-level metric monitoring, augmented by a Dynamic IoU reward to stabilize perception tasks. The Orsta family (7B–32B) demonstrates consistent gains on MEGA-Bench Core and strong downstream performance, validating scalable, multi-task RL for VLMs. The work provides practical engineering strategies and a public codebase to accelerate adoption of unified RL in vision-language understanding.

Abstract

Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.

One RL to See Them All: Visual Triple Unified Reinforcement Learning

TL;DR

<3-5 sentence high-level summary>V-Triune introduces a unified reinforcement learning framework for post-training vision-language models that jointly optimizes visual reasoning and perception. It decomposes the system into sample-level data formatting, verifier-level reward computation, and source-level metric monitoring, augmented by a Dynamic IoU reward to stabilize perception tasks. The Orsta family (7B–32B) demonstrates consistent gains on MEGA-Bench Core and strong downstream performance, validating scalable, multi-task RL for VLMs. The work provides practical engineering strategies and a public codebase to accelerate adoption of unified RL in vision-language understanding.

Abstract

Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.

Paper Structure

This paper contains 28 sections, 8 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Performance of Orsta on MEGA-Bench Tasks. V-Triune is evaluated across visual reasoning and visual perception tasks—Math, Science, Charting, Puzzle, Detection, Grounding, Counting, and OCR, demonstrating notable performance gains of Orsta over the backbone: +3.2%, +14.1%, and +2.1% in different model variants.
  • Figure 2: V-Triune System. It integrates three complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (for custom rewards via specialized verifiers), and Source-Level Metric Monitoring (to diagnose data-source level problems). Additionally, a novel Dynamic IoU reward offers adaptive, progressive feedback for perception tasks.
  • Figure 3: Sample-level Data Scheme for Unified Training. This format, implemented using HuggingFace datasets, allows fine-grained control over reward computation by defining reward_model (including reward types, weights like accuracy/format_ratio) and verifier specifications at the individual sample level. This enables flexible and scalable handling of diverse multimodal tasks.
  • Figure 4: Architecture of the Asynchronous Reward Server. The RL trainer interacts with a remote server via client-server proxies, where specialized verifiers (e.g., MathVerify, Detection) compute rewards using task-specific logic and dynamic thresholds (e.g., dynamic IoU threshold).
  • Figure 5: COCO Test Set Performance with Various Reward Designs. (a) Comparison between IoU-based and mAP-based rewards on a selected COCO multi-object subset; (b) Comparison between vanilla IoU reward and rule-based IoU reward on a selected COCO single-object subset; (c, d) Comparison between rule-based IoU reward and dynamic IoU reward on the COCO multi-object subset and the OVDEval negation subset.
  • ...and 10 more figures