Table of Contents
Fetching ...

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu

Abstract

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Abstract

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.
Paper Structure (19 sections, 10 equations, 5 figures, 6 tables)

This paper contains 19 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration and Performance of Insight-V and Insight-V++. Insight-V employs a dual-agent architecture, comprising dedicated reasoning and summarization modules to achieve significant performance gains across various image reasoning benchmarks. Building upon this foundation, Insight-V++ introduces a novel self-evolving training paradigm. By utilizing an iterative loop of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to continuously refine its visual reasoning capabilities, Insight-V++ yields superior results on both image and video benchmarks.
  • Figure 2: Data Generation Pipeline of Insight-V Series. The reasoning processes are generated progressively through a reasoning generator, and then fed into a multi-granularity assessment system to ensure high-quality reasoning.
  • Figure 3: Overview of Insight-V and Insight-V++ Model Design. In Insight-V, we derive a multi-agent system from a single base model. By decomposing the task into reasoning and summarization, these two agents collaborate to enhance overall performance. For Insight-V++, we further introduce ST-GRPO and J-GRPO, which are tailored specifically to each agent. Coupled with a self-evolving strategy, this training loop consisting of SFT and customized RL drives the multi-agent system to excel at visual reasoning.
  • Figure 4: Ablations on the amount of training data. The reasoning agent benefits from data scaling, providing more valuable insights for the summary agent.
  • Figure 5: Qualitative Results of Insight-V Series. We present qualitative comparisons of Insight-V with Chain-of-Thought and learning Insight-V with direct SFT (Vanilla). For the Insight-V system, the reasoning agent delivers a more coherent and structured reasoning process, guiding the summary agent toward the correct answer, whereas other methods struggle with complex reasoning tasks and fail to solve such challenging problems.