Superficial Self-Improved Reasoners Benefit from Model Merging
Xiangchi Yuan, Chunhui Zhang, Zheyuan Liu, Dachuan Shi, Leyan Pan, Soroush Vosoughi, Wenke Lee
TL;DR
The paper identifies a risk in LLM self-improvement where gains on in-domain reasoning come at the cost of out-of-domain generalization due to memorization. It analyzes layer-wise contributions and finds a mismatch: reasoning-critical layers receive small updates while less important layers change more, driving superficial improvements. The authors propose Iterative Model Merging (IMM), optionally augmented with DARE masking, to fuse base and self-improved models across iterations, preserving generalization while enabling reasoning gains. Empirical results across multiple datasets and model scales show that IMM mitigates model collapse, maintains or improves $OOD$ performance, and extends to distillation scenarios, highlighting its practical potential for robust self-improving systems.
Abstract
As scaled language models (LMs) approach human-level reasoning capabilities, self-improvement emerges as a solution to synthesizing high-quality data corpus. While previous research has identified model collapse as a risk in self-improvement, where model outputs become increasingly deterministic, we discover a more fundamental challenge: the superficial self-improved reasoners phenomenon. In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities on out-of-domain (OOD) tasks due to memorization rather than genuine. Through a systematic investigation of LM architecture, we discover that during self-improvement, LM weight updates are concentrated in less reasoning-critical layers, leading to superficial learning. To address this, we propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization while incorporating genuine reasoning improvements. Our approach effectively mitigates both LM collapse and superficial learning, moving towards more stable self-improving systems.
