Table of Contents
Fetching ...

Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

Yuming Yan, Shuo Yang, Kai Tang, Sihong Chen, Yang Zhang, Ke Xu, Dan Hu, Qun Yu, Pengfei Hu, Edith C. H. Ngai

TL;DR

RCPA addresses the challenge of adapting vision-language models to specialized domains without eroding general multimodal capabilities. It introduces a two-phase process—Pre-Alignment and Reinforcement Alignment—augmented by curriculum modules CPP and CDP, built on a GRPO-based backbone (GRPON) with a curated reward signal. The approach balances constrained imitation and reward-driven optimization, mitigating optimization collapse and forgetting. Empirical results on COCO, Geo170K, and OpenI show competitive domain performance with FFT while maintaining strong generalization, outperforming SFT and standard RL baselines in both domain-specific and general measures.

Abstract

Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model's domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.

Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

TL;DR

RCPA addresses the challenge of adapting vision-language models to specialized domains without eroding general multimodal capabilities. It introduces a two-phase process—Pre-Alignment and Reinforcement Alignment—augmented by curriculum modules CPP and CDP, built on a GRPO-based backbone (GRPON) with a curated reward signal. The approach balances constrained imitation and reward-driven optimization, mitigating optimization collapse and forgetting. Empirical results on COCO, Geo170K, and OpenI show competitive domain performance with FFT while maintaining strong generalization, outperforming SFT and standard RL baselines in both domain-specific and general measures.

Abstract

Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model's domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.
Paper Structure (22 sections, 6 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: The Overview of RCPA. RCPA is a post-training framework that integrates domain knowledge acquisition with preference alignment in a curriculum-driven manner, built upon the GRPON (GRPO for Non-Deep-Thinking Models) framework for VQA-style tasks. It consists of two phases: Pre-Alignment, which introduces domain concepts with controlled constraints to bootstrap initial competence, and Reinforcement Alignment, which refines the model’s responses using full reward-driven optimization. Key components include Curriculum Progress Perception (CPP), which adjusts reward thresholds to match the model's evolving competence, and Curriculum Difficulty Perception (CDP), which prioritizes difficult samples to enhance training efficiency and prevent overfitting.
  • Figure 2: Results of Ablation and Parameter studies. (a) Ablation on COCO Captions demonstrates the contribution of CPP and CDP, with CPP leading to larger gains in domain-specific learning, while CDP enhances specialization by reweighting samples based on difficulty. (b) The impact of the sliding token ratio ($\sigma$) on domain knowledge learning and training efficiency reveals the optimal value of 16. (c) and (d) Parameter study results show that optimal performance is achieved with $\alpha=0.6, \beta=0.7, \delta_{\text{min}} = 0.7$, and $\delta_{\text{max}}=0.8$, based on aggregated evaluation metrics.
  • Figure 3: (a) The pre-alignment stage accounts for 28% of the total training time. (b) Under the same batch size, RCPA increases the computation time per step by 56% compared with GRPO. (c) Compared with GRPO, our RCPA reduces policy update variance by about 41% (measured via KL divergence between consecutive policies). (d) Compared with GRPO, our proposed RCPA enables a continuous and stable increment in reward values throughout the entire training process, demonstrating more favorable reward growth characteristics in terms of both sustainability and stability.