Table of Contents
Fetching ...

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue

TL;DR

OneThinker proposes a unified all-in-one multimodal reasoning model that handles image and video tasks across QA, captioning, grounding, tracking, and segmentation. It builds OneThinker-600k with a CoT-annotated subset OneThinker-SFT-340k and introduces EMA-GRPO to balance heterogeneous rewards in multi-task RL. The approach yields strong, transferable performance across 31 benchmarks and demonstrates cross-task knowledge sharing and preliminary zero-shot generalization to unseen tasks, marking progress toward a multimodal reasoning generalist. The work provides dataset, training methodology, and algorithmic innovations aimed at scalable, unified multimodal reasoning across vision tasks and modalities.

Abstract

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

OneThinker: All-in-one Reasoning Model for Image and Video

TL;DR

OneThinker proposes a unified all-in-one multimodal reasoning model that handles image and video tasks across QA, captioning, grounding, tracking, and segmentation. It builds OneThinker-600k with a CoT-annotated subset OneThinker-SFT-340k and introduces EMA-GRPO to balance heterogeneous rewards in multi-task RL. The approach yields strong, transferable performance across 31 benchmarks and demonstrates cross-task knowledge sharing and preliminary zero-shot generalization to unseen tasks, marking progress toward a multimodal reasoning generalist. The work provides dataset, training methodology, and algorithmic innovations aimed at scalable, unified multimodal reasoning across vision tasks and modalities.

Abstract

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

Paper Structure

This paper contains 30 sections, 12 equations, 19 figures, 10 tables.

Figures (19)

  • Figure 1: Overview of our OneThinker, which is capable of thinking across a wide range of tasks for image and video understanding.
  • Figure 2: Performance gains of our model over Qwen3-VL-Instruct-8B across diverse visual tasks after training.
  • Figure 3: Overview of our curated training dataset, including both image and video modalities for a diverse range of understanding tasks.
  • Figure 4: Comparison of advantage formulations in three RL algorithms.
  • Figure 5: Performance on unseen visual tasks.
  • ...and 14 more figures