Learning Dynamics of VLM Finetuning
Jusheng Zhang, Kaitong Cai, Jing Yang, Keze Wang
TL;DR
The paper addresses instability in preference-based fine-tuning of vision-language models caused by uninformative negatives. It introduces CW-DPO, a two-stage, learning-dynamics-aware framework: Stage 1 Constrained SFT (SFT-C) to smooth the loss landscape, and Stage 2 competence-aware Direct Preference Optimization (DPO) with a cooling weight to down-weight easy negatives and emphasize hard negatives. A formal learning-dynamics lens decomposes per-step influence into Belief Geometry, eNTK, and Loss Residual, identifying the loser gradient as the key instability source and guiding the design of the cooling mechanism. Empirical results across image captioning and multimodal tasks show CW-DPO achieves more stable optimization, better calibration, higher win rates, and faster convergence than SFT or vanilla DPO, with ablations confirming the pivotal role of the cooling weight. The approach offers a practical, generalizable principle for robust VLM alignment and can extend to broader multimodal fine-tuning scenarios.
Abstract
Preference-based finetuning of vision--language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as \textbf{learning-dynamics--aware optimization} and introduce \textbf{Cooling-Weighted DPO (CW-DPO)}, a two-stage recipe that explicitly models and exploits the training trajectory. \textbf{Stage 1} performs supervised finetuning with \textbf{gentle negatives}: \textbf{low-weight smoothed supervision} that regularizes the base policy and curbs overconfidence without explicit penalties. \textbf{Stage 2} applies a DPO objective in which the \textbf{negative term is scaled by a cooling weight} computed from the model's \textbf{average token log-probability} on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize \textbf{on-policy negatives} and allow \textbf{mixed negatives} by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with $Δ\!\log p$ probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields \textbf{more stable optimization}, \textbf{better calibration}, and \textbf{higher pairwise win-rates} than SFT-only and vanilla DPO, while \textbf{converging in fewer steps}. Ablations isolate the \textbf{cooling-weight mechanism} as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that \textbf{smoothing learning dynamics before cooling preferences} is a simple, general principle for robust VLM alignment.
