EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Omkar Thawakar; Shravan Venkatraman; Ritesh Thawkar; Abdelrahman Shaker; Hisham Cholakkal; Rao Muhammad Anwer; Salman Khan; Fahad Khan

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

TL;DR

EvoLMM tackles the problem of unsupervised enhancement of visual reasoning in large multimodal models by decomposing a base backbone into a Proposer that creates image-grounded questions and a Solver that answers them, all trained through a continuous internal self-consistency reward without any human labels or external evaluators. The approach replaces brittle discrete rewards with smooth, gradient-friendly signals that scale with answer agreement and question difficulty, enabling a self-curriculum where progressively more challenging but solvable queries emerge. Empirically, EvoLMM yields consistent +2–3% gains across multiple multimodal math and diagram reasoning benchmarks and demonstrates robustness across backbones and model sizes, while maintaining data- and annotation-free training. This work advances autonomous, scalable multimodal learning and lays groundwork for open-ended self-improvement of reasoning capabilities in vision-language systems.

Abstract

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

TL;DR

Abstract

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)