Table of Contents
Fetching ...

Empowering Small VLMs to Think with Dynamic Memorization and Exploration

Jiazhen Liu, Yuchuan Deng, Long Chen

TL;DR

DyME is proposed, a novel training paradigm that Dynamically selects between Memorization and Exploration at each optimization step that serves as a robust, standalone strategy that stabilizes SVLM learning.

Abstract

Small-scale Vision-Language Models (SVLMs) are exceptionally well-suited for proprietary tasks. Equipping them with thinking capabilities is a critical step to enhance their performance and reliability in these specific domains. However, existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capacity of SVLMs. Consequently, directly applying these paradigms to SVLMs fails to instill the desired thinking abilities. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. Yet the core challenge lies in managing the inherent trade-off: excessive reliance on SFT can force the model to memorize pseudo thinking traces, while over-emphasizing RLVR can lead to unstable exploration (i.e., advantage collapse). To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) at each optimization step. By ensuring that every update contributes to the trade-off, DyME serves as a robust, standalone strategy that stabilizes SVLM learning. Complementing this paradigm, we further introduce a synergistic Visual Supervision mechanism (comprising a visual checker and refiner) designed to inject dynamically enhanced, image-grounded guidance during optimization. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements on specialized tasks. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: https://github.com/HKUST-LongGroup/DyME

Empowering Small VLMs to Think with Dynamic Memorization and Exploration

TL;DR

DyME is proposed, a novel training paradigm that Dynamically selects between Memorization and Exploration at each optimization step that serves as a robust, standalone strategy that stabilizes SVLM learning.

Abstract

Small-scale Vision-Language Models (SVLMs) are exceptionally well-suited for proprietary tasks. Equipping them with thinking capabilities is a critical step to enhance their performance and reliability in these specific domains. However, existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capacity of SVLMs. Consequently, directly applying these paradigms to SVLMs fails to instill the desired thinking abilities. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. Yet the core challenge lies in managing the inherent trade-off: excessive reliance on SFT can force the model to memorize pseudo thinking traces, while over-emphasizing RLVR can lead to unstable exploration (i.e., advantage collapse). To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) at each optimization step. By ensuring that every update contributes to the trade-off, DyME serves as a robust, standalone strategy that stabilizes SVLM learning. Complementing this paradigm, we further introduce a synergistic Visual Supervision mechanism (comprising a visual checker and refiner) designed to inject dynamically enhanced, image-grounded guidance during optimization. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements on specialized tasks. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: https://github.com/HKUST-LongGroup/DyME

Paper Structure

This paper contains 21 sections, 8 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Training paradigms for enabling VLM thinking. The LVLM is Qwen2.5-VL-32B Qwen2.5-VL and the SVLM is SmolVLM-500M marafioti2025smolvlm. (a) Existing paradigms are effective for LVLMs but unsuitable for SVLMs. (b) The two-stage training paradigm (SFT $\rightarrow$ RL) faces a challenging trade-off. Our proposed DyME dynamically balances this trade-off.
  • Figure 2: Performance of SmolVLM-500M marafioti2025smolvlm on ChartQA masry-etal-2022-chartqa. Existing paradigms degrade performance, whereas DyME yields improvements.
  • Figure 3: Workflow and module components of DyME. At each training step, DyME dynamically switches between memorization (via SFT) and exploration (via GRPO) modes based on its generations. Visual supervision is introduced through the visual refiner and visual checker. The refiner enhances the targets for memorization by incorporating richer visual elements (green), while the checker rewards the thinking context generated based on their visual relevance.
  • Figure 4: Training rewards. GRPO and two-stage training suffer from severe advantage collapse.
  • Figure 5: Showcases on chart understanding and geometry solving. We use LLaVA-OV-S to demonstrate the results. The SVLM originally produces hallucinated answers (red), while the DyME-trained model generates structured thinking traces (green) that incorporate grounded values, effectively improving the performance.
  • ...and 5 more figures