Table of Contents
Fetching ...

Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning

Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, Lin Ma

TL;DR

Metis-RISE presents a hybrid training paradigm for multimodal reasoning that first uses RL incentivization (via Group Relative Policy Optimization) to awaken latent reasoning capabilities, then applies supervised fine-tuning with Self-Distilled Trajectories and Expert-Augmented knowledge to address sampling inefficiency and capability gaps. The approach trains 7B and 72B parameter models, achieving state-of-the-art performance among similar-sized OpenCompass leaderboard entries and competitive results with larger proprietary systems. Ablation studies confirm that RL provides a strong early boost while SFT consolidates and extends reasoning, with mixed-modal data yielding the best gains. The work suggests a scalable, two-stage recipe for advancing multimodal reasoning in LLM-based systems and outlines directions for iterative training and model-based verification.

Abstract

Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and \textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) \textit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) \textit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall. Please refer to our project page for open-source information.

Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning

TL;DR

Metis-RISE presents a hybrid training paradigm for multimodal reasoning that first uses RL incentivization (via Group Relative Policy Optimization) to awaken latent reasoning capabilities, then applies supervised fine-tuning with Self-Distilled Trajectories and Expert-Augmented knowledge to address sampling inefficiency and capability gaps. The approach trains 7B and 72B parameter models, achieving state-of-the-art performance among similar-sized OpenCompass leaderboard entries and competitive results with larger proprietary systems. Ablation studies confirm that RL provides a strong early boost while SFT consolidates and extends reasoning, with mixed-modal data yielding the best gains. The work suggests a scalable, two-stage recipe for advancing multimodal reasoning in LLM-based systems and outlines directions for iterative training and model-based verification.

Abstract

Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and \textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) \textit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) \textit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall. Please refer to our project page for open-source information.

Paper Structure

This paper contains 17 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: We benchmark the proposed Metis-RISE on the OpenCompass Multimodal Reasoning Leaderboard, comparing it with other state-of-the-art methods.
  • Figure 2: Overview of the Metis-RISE framework. RL first incentivizes exploration and activates latent reasoning. Subsequent SFT stages enhance these abilities by addressing inefficient trajectory sampling (via self-distillation) and fundamental capability absence (via expert-augmented knowledge injection).
  • Figure 3: Training dynamics of Accuracy Reward and Response Length during the initial RL phase for Metis-RISE-72B. Subfigure (a) shows the progression of accuracy reward, and subfigure (b) illustrates the change in average response length as training proceeds.
  • Figure 4: Example of a multi-step analytic geometry problem solved by Metis-RISE-72B.
  • Figure :
  • ...and 4 more figures