Table of Contents
Fetching ...

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye

TL;DR

This work proposes Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement.

Abstract

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

TL;DR

This work proposes Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement.

Abstract

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.
Paper Structure (38 sections, 21 equations, 6 figures, 5 tables)

This paper contains 38 sections, 21 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Due to the lack of interpretable diagnostics and scarcity of visual diversity, previous self-evolution frameworks can alleviate hallucination to some extent but fail to provide meaningful improvements on long-tail tasks such as mathematics and OCR. As a result, the model often exhibits instability or even degradation in these capabilities during the evolution process. In contrast, our DPE framework effectively addresses these blind spots and supports a more comprehensive and balanced progression of the model’s abilities.
  • Figure 2: Overview of the DPE framework.
  • Figure 3: Ablation results on CharXiv and MathVision across three iterations, comparing full DPE with variants.
  • Figure 4: Category distribution of the seed set and the diagnosis-guided mixture ratios recommended by DPE over three iterations.
  • Figure 5: UMAP visualization of image diversity (left) and text diversity (right) for VisPlay and DPE.
  • ...and 1 more figures