Table of Contents
Fetching ...

Learning Dynamics of VLM Finetuning

Jusheng Zhang, Kaitong Cai, Jing Yang, Keze Wang

TL;DR

The paper addresses instability in preference-based fine-tuning of vision-language models caused by uninformative negatives. It introduces CW-DPO, a two-stage, learning-dynamics-aware framework: Stage 1 Constrained SFT (SFT-C) to smooth the loss landscape, and Stage 2 competence-aware Direct Preference Optimization (DPO) with a cooling weight to down-weight easy negatives and emphasize hard negatives. A formal learning-dynamics lens decomposes per-step influence into Belief Geometry, eNTK, and Loss Residual, identifying the loser gradient as the key instability source and guiding the design of the cooling mechanism. Empirical results across image captioning and multimodal tasks show CW-DPO achieves more stable optimization, better calibration, higher win rates, and faster convergence than SFT or vanilla DPO, with ablations confirming the pivotal role of the cooling weight. The approach offers a practical, generalizable principle for robust VLM alignment and can extend to broader multimodal fine-tuning scenarios.

Abstract

Preference-based finetuning of vision--language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as \textbf{learning-dynamics--aware optimization} and introduce \textbf{Cooling-Weighted DPO (CW-DPO)}, a two-stage recipe that explicitly models and exploits the training trajectory. \textbf{Stage 1} performs supervised finetuning with \textbf{gentle negatives}: \textbf{low-weight smoothed supervision} that regularizes the base policy and curbs overconfidence without explicit penalties. \textbf{Stage 2} applies a DPO objective in which the \textbf{negative term is scaled by a cooling weight} computed from the model's \textbf{average token log-probability} on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize \textbf{on-policy negatives} and allow \textbf{mixed negatives} by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with $Δ\!\log p$ probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields \textbf{more stable optimization}, \textbf{better calibration}, and \textbf{higher pairwise win-rates} than SFT-only and vanilla DPO, while \textbf{converging in fewer steps}. Ablations isolate the \textbf{cooling-weight mechanism} as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that \textbf{smoothing learning dynamics before cooling preferences} is a simple, general principle for robust VLM alignment.

Learning Dynamics of VLM Finetuning

TL;DR

The paper addresses instability in preference-based fine-tuning of vision-language models caused by uninformative negatives. It introduces CW-DPO, a two-stage, learning-dynamics-aware framework: Stage 1 Constrained SFT (SFT-C) to smooth the loss landscape, and Stage 2 competence-aware Direct Preference Optimization (DPO) with a cooling weight to down-weight easy negatives and emphasize hard negatives. A formal learning-dynamics lens decomposes per-step influence into Belief Geometry, eNTK, and Loss Residual, identifying the loser gradient as the key instability source and guiding the design of the cooling mechanism. Empirical results across image captioning and multimodal tasks show CW-DPO achieves more stable optimization, better calibration, higher win rates, and faster convergence than SFT or vanilla DPO, with ablations confirming the pivotal role of the cooling weight. The approach offers a practical, generalizable principle for robust VLM alignment and can extend to broader multimodal fine-tuning scenarios.

Abstract

Preference-based finetuning of vision--language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as \textbf{learning-dynamics--aware optimization} and introduce \textbf{Cooling-Weighted DPO (CW-DPO)}, a two-stage recipe that explicitly models and exploits the training trajectory. \textbf{Stage 1} performs supervised finetuning with \textbf{gentle negatives}: \textbf{low-weight smoothed supervision} that regularizes the base policy and curbs overconfidence without explicit penalties. \textbf{Stage 2} applies a DPO objective in which the \textbf{negative term is scaled by a cooling weight} computed from the model's \textbf{average token log-probability} on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize \textbf{on-policy negatives} and allow \textbf{mixed negatives} by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields \textbf{more stable optimization}, \textbf{better calibration}, and \textbf{higher pairwise win-rates} than SFT-only and vanilla DPO, while \textbf{converging in fewer steps}. Ablations isolate the \textbf{cooling-weight mechanism} as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that \textbf{smoothing learning dynamics before cooling preferences} is a simple, general principle for robust VLM alignment.

Paper Structure

This paper contains 45 sections, 1 theorem, 29 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

The log-likelihood change on $\chi_o$ post-update on $\chi_u$ (rate $\eta$) approximates: Key Elements: Belief Geometry ($A_t$) encodes predictive sensitivity to logit perturbations, capturing belief-landscape curvature. eNTK Kernel ($K_t = J_o J_u^\top$) ($J = \nabla_\theta z(\theta_t; \chi)$: Jacobian) propagates updates parametrically. Loss Residual ($G_t$) directs logit adjustments v

Figures (6)

  • Figure 1: Two-stage optimization process of CW-DPO. Stage 1 ($y^+$ Training) leverages positive supervision for stability but yields overly uniform language styles (e.g., "A … on the …"). Stage 2 ($y^-$ Training) introduces negative contrast for variation but risks errors (e.g., a running kitten as "flying"). CW-DPO's cooling-weighted mechanism dynamically attenuates uninformative negatives while amplifying hard ones, mitigating error propagation, and enhancing stylistic diversity.
  • Figure 2: our CW-DPO is designed to balance generalization and precision through a two-stage optimization strategy. In Stage 1, Smooth SFT leverages positive samples together with negative samples containing minor errors to construct a smoothed supervision signal. This broadens the model's output probability distribution, thereby enhancing its generalization ability and robustness. In Stage 2, our CW-DPO employs preference pairs with fine-grained errors for DPO. By sharpening the probability distribution, this stage strengthens the model's capacity for precise discrimination of critical details.
  • Figure 3: Validation of Stage 1 Constrained SFT (SFT-C) vs. standard SFT on: (1) loss; (2) entropy; (3) CIDEr; and (4) SPICE for Top-5 generations. SFT-C sustains higher entropy (less squeezing) and quality.
  • Figure 4: CW-DPO alleviates the squeezing effect of vanilla DPO, yielding smaller distribution shifts (left), smoother posteriors (middle), and improved generation quality with better calibration (right).
  • Figure 5: Theoretical Decomposition of $LB_{K_{uo}}$ Across CW-DPO Training. Each column represents a different fixed update sample ($y_1, y_2, y_3$). Each row visualizes the trajectory of a specific component from Proposition 1 for four observation samples ($x_o$). The results show that the growth of the final proxy metric ($LB_{K_{uo}}$, bottom row) is primarily driven by the systematic increase in the Belief Geometry term ($||A_o||_F^2$, second row) and the decay of the Loss Residual ($||G_o||_F^2$, third row), while the per-step Update Influence ($||\Delta \log \pi||_F^2$, top row) remains stationary.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Remark 1: Insufficiency of DPO's Implicit Regularization
  • Proposition 1: Sequence-Aware One-Step Influence