Table of Contents
Fetching ...

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

TL;DR

EvoLMM tackles the problem of unsupervised enhancement of visual reasoning in large multimodal models by decomposing a base backbone into a Proposer that creates image-grounded questions and a Solver that answers them, all trained through a continuous internal self-consistency reward without any human labels or external evaluators. The approach replaces brittle discrete rewards with smooth, gradient-friendly signals that scale with answer agreement and question difficulty, enabling a self-curriculum where progressively more challenging but solvable queries emerge. Empirically, EvoLMM yields consistent +2–3% gains across multiple multimodal math and diagram reasoning benchmarks and demonstrates robustness across backbones and model sizes, while maintaining data- and annotation-free training. This work advances autonomous, scalable multimodal learning and lays groundwork for open-ended self-improvement of reasoning capabilities in vision-language systems.

Abstract

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

TL;DR

EvoLMM tackles the problem of unsupervised enhancement of visual reasoning in large multimodal models by decomposing a base backbone into a Proposer that creates image-grounded questions and a Solver that answers them, all trained through a continuous internal self-consistency reward without any human labels or external evaluators. The approach replaces brittle discrete rewards with smooth, gradient-friendly signals that scale with answer agreement and question difficulty, enabling a self-curriculum where progressively more challenging but solvable queries emerge. Empirically, EvoLMM yields consistent +2–3% gains across multiple multimodal math and diagram reasoning benchmarks and demonstrates robustness across backbones and model sizes, while maintaining data- and annotation-free training. This work advances autonomous, scalable multimodal learning and lays groundwork for open-ended self-improvement of reasoning capabilities in vision-language systems.

Abstract

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto 3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

Paper Structure

This paper contains 15 sections, 11 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of our fully unsupervised self-evolving LMM framework (EvoLMM). Our EvoLMM enables a base LMM to improve its reasoning ability without any human labels, metadata, or external reward models. Given only raw images, a proposer first generates visually grounded questions, and a solver attempts to answer them multiple times. The degree of agreement among solver responses produces a continuous self-consistency reward, forming a closed-loop training signal that drives both modules to co-evolve.
  • Figure 2: Overview of our Proposer–Solver based self-evolving framework (EvoLMM). Given only a raw visual input (e.g., multimodal chart), the Proposer module generates a question $q$ about the image content. The Solver then produces multiple answer samples $y_{1:N}$, forming an empirical answer distribution $p(a\,|\,x,q)$. The Solver reward$r^{\text{sol}}$ is a continuous, self-supervised signal based on the likelihood of each answer sample, modulated by a length penalty that constrains the Solver's response format. The Proposer reward$r^{\text{prop}}$ is an entropy-based band-pass function that encourages moderate difficulty questions where Solver is not completely correct and certain. By rewarding this moderate-entropy window, the Proposer gradually learns to generate questions that are challenging enough to stimulate reasoning while remaining solvable, forming an automatic curriculum without external supervision. Both modules are optimized with standard REINFORCE objectives regularized by token-level KL constraints to reference policies. This closed-loop training enables jointly refining question generation and reasoning using only images, without any annotated Q&A pairs, discrete rewards, or external verifiers.
  • Figure 3: Comparison of Proposer and Solver rewards under discrete and our continuous formulations for a single iteration. Each panel shows rewards for low-entropy (high-agreement) and high-entropy (diverse-answer) cases for a single iteration for number of Solver responses $N{=}5$. In case of Proposer (left), discrete rewards collapse them into identical plateaus, providing weaker learning signals. In contrast, our continuous reward varies smoothly and distinguishes Solver response patterns. In case of Solver (right), the discrete reward increases linearly only with majority count and does not reflect partial progress, leading to sparse and unstable learning signals during early training. Instead, our continuous Solver reward scales smoothly with agreement, enabling the Solver to improve reasoning consistency gradually. Collectively, our continuous reward formulation creates a more stable self-evolution loop, where the Proposer and Solver co-adapt toward more grounded and consistent reasoning behavior.
  • Figure 4: Comparison between the discrete vs. our continuous reward progression during training. Top: The discrete majority-vote reward (red) remains low and unstable during training, providing a weak learning signal due to Solver output variability in early stages. In contrast, our continuous self-consistency reward (green) produces stable and valuable feedback, enabling the Proposer to consistently generate moderate difficulty, informative questions. Bottom: Guided by our Proposer with continuous reward, we observe the Solver (green) to be more consistent and stable compared to its discrete (red) counterpart.
  • Figure 5: Example showing progression of our continuous reward-based Proposer questions generation along with rewards. From top (step 102) to bottom (step 5122), Proposer increases question complexity, which in turn enhances the reasoning capabilities of Solver. Refer suppl. material for more examples.
  • ...and 1 more figures