Table of Contents
Fetching ...

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Yunyao Yu, Zhengxian Wu, Zhuohong Chen, Hangrui Xu, Zirui Liao, Xiangwen Deng, Zhifang Liu, Senyuan Shi, Haoqian Wang

Abstract

In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Abstract

In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.

Paper Structure

This paper contains 25 sections, 2 theorems, 25 equations, 6 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

In the GRPO optimization objective, adopting majority voting leads to distribution collapse, and its ideal closed-form solution $P(x)$ and the reference distribution $P_{\text{ref}}(x)$ satisfy the following relation: where $Z$ is the normalization constant. $\blacktriangleleft$$\blacktriangleleft$

Figures (6)

  • Figure 1: Overview of our work. During unsupervised self-evolutionary reinforcement learning, traditional majority voting methods (left) for pseudo-labeling rely solely on the model's inherent biases. This frequently leads to model collapse, where the model degenerates into a deterministic mapping and fails to explore the true solutions. To alleviate this phenomenon, our method introduces CSRS (right) to reduce the occurence of situations.
  • Figure 2: Pipeline of our method. (Round1) illustrate initially maternal trajectories and answers generated by MLLM. (Round2) shows three core components of CSRS and improvement to rewards.
  • Figure 3: Visualization of the accuracy and propotion of high-confidence samples during training. (a) Answer accuracy calculated within partitioned frequency intervals, where each sample is uniquely assigned based on its frequency. (b) The evolution of high-confidence (frequence $\in$$[0.8-1.0]$) samples proportions during training, which serves as a key indicator of model collapse.
  • Figure 4: The change of distribution of Maternal Answers Set and Re-inference Answers Set. The figure shows different distribution of samples in two sets at the step of 20,60,100. Blue dots represent the semantic space of vanilla responses, while orange dots denote the semantic space of responses generated by CSRS.
  • Figure 5: Entropy Change during training
  • ...and 1 more figures

Theorems & Definitions (4)

  • Proposition 1: Self-evolution Closed-form Solution
  • proof : Proof
  • Proposition 2: CSRS Can Relieve Model Collapse
  • proof : Proof