Table of Contents
Fetching ...

Uncovering Cross-Objective Interference in Multi-Objective Alignment

Yining Lu, Meng Jiang

TL;DR

The paper identifies cross-objective interference as a key failure mode in multi-objective LLM alignment and develops a unified framework combining local improvement and global convergence analyses. A local covariance law shows that first-order improvements occur when objective rewards covary positively with the scalarized score, and this insight extends to clipped surrogates used in modern RL fine-tuning. The authors propose Covariance Targeted Weight Adaptation (CTWA), a plug-in controller that preserves positive covariance across objectives, and demonstrate its effectiveness against baselines across multiple models and RL algorithms. Additionally, a global convergence view via the $\mu$-PL condition explains model-dependent differences and provides conditions under which scalarized optimization converges globally. Together, these results advance robust, balanced multi-objective alignment for LLMs with practical, low-overhead implementation advantages.

Abstract

We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across classic scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves at first order when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between objective rewards and the training signal to effectively mitigate cross-objective interference. Finally, we complement these local improvement conditions with a global convergence analysis under the Polyak--Łojasiewicz condition, establishing when non-convex scalarized optimization achieves global convergence and how cross-objective interference depends on specific model geometric properties.

Uncovering Cross-Objective Interference in Multi-Objective Alignment

TL;DR

The paper identifies cross-objective interference as a key failure mode in multi-objective LLM alignment and develops a unified framework combining local improvement and global convergence analyses. A local covariance law shows that first-order improvements occur when objective rewards covary positively with the scalarized score, and this insight extends to clipped surrogates used in modern RL fine-tuning. The authors propose Covariance Targeted Weight Adaptation (CTWA), a plug-in controller that preserves positive covariance across objectives, and demonstrate its effectiveness against baselines across multiple models and RL algorithms. Additionally, a global convergence view via the -PL condition explains model-dependent differences and provides conditions under which scalarized optimization converges globally. Together, these results advance robust, balanced multi-objective alignment for LLMs with practical, low-overhead implementation advantages.

Abstract

We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across classic scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves at first order when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between objective rewards and the training signal to effectively mitigate cross-objective interference. Finally, we complement these local improvement conditions with a global convergence analysis under the Polyak--Łojasiewicz condition, establishing when non-convex scalarized optimization achieves global convergence and how cross-objective interference depends on specific model geometric properties.
Paper Structure (39 sections, 9 theorems, 113 equations, 5 figures, 1 table, 5 algorithms)

This paper contains 39 sections, 9 theorems, 113 equations, 5 figures, 1 table, 5 algorithms.

Key Result

Lemma 4.1

The optimizer of eq:kl-policy-improvement is

Figures (5)

  • Figure 1: Multi-objective alignment under different scalarization algorithms. We report moving-averaged test performance along training for three objectives: accuracy, conciseness, and clarity (left to right). We aim to train models with strong problem-solving ability (higher accuracy), computational efficiency (fewer response tokens), and clear reasoning processes (higher clarity). Results are shown for three models trained on the Math500 dataset with different scalarization algorithms adapted from MTL and MOO. Our method, CTWA, effectively mitigates cross-objective interference compared to others. Competing methods either quickly sacrifice accuracy to achieve superficially high conciseness and clarity (e.g., GradNorm in \ref{['fig:reinforce-qwen2.5-base']}, Linear and Dynamic weighting in \ref{['fig:reinforce-qwen2.5-ift']}), or trying to maintain high accuracy while overlooking the improvment of others (e.g., Lagrangian in \ref{['fig:reinforce-qwen2.5-base']} and PAMA in \ref{['fig:reinforce-qwen2.5-ift']}). In contrast, CTWA achieves strong, balanced performance across all three objectives. For instance, in \ref{['fig:reinforce-qwen3-base']}, CTWA maintains the highest accuracy without any degradation while achieving competitive conciseness and clarity. Even when CTWA's accuracy is slightly lower than Lagrangian's (e.g., at training step 500 in \ref{['fig:reinforce-qwen2.5-base']} and \ref{['fig:reinforce-qwen2.5-ift']}), it still surpasses all other methods and excels on both conciseness and clarity.
  • Figure 2: Scalarization weights in log space ($u_m$) during training of Qwen3-1.7B-Base.
  • Figure 3: Covariance $c_m$ between reward and clipped advantage weight for each objective during training of Qwen3-1.7B-Base.
  • Figure 4: Gradient alignment across objectives during multi-objective alignment. We measure pairwise cosine similarity between per-objective gradients throughout training. Negative values indicate conflicting updates, which is a standard proxy for identifying conflicting objectives in MTL NEURIPS2020_3fe78a8a. Across all three models, cosine similarities remain mostly non-negative and converge toward 0 as training progresses, suggesting that objectives are weakly coupled, neither strongly synergistic nor persistently antagonistic, with no conflicting behavior observed.
  • Figure 5: Multi-objective alignment using GRPO with different scalarization algorithms. We report moving-averaged test performance along training for three objectives: accuracy, conciseness, and clarity (left to right). Similar to observations from \ref{['fig:main intro figure']}, CTWA achieves the most balanced performance across objectives without excessively sacrificing one for another. While Lagrangian, PAMA and Tchebycheff maintain higher accuracy in \ref{['fig:grpo-qwen2.5-base']} and \ref{['fig:grpo-qwen2.5-ift']}, each of them has significant drawbacks. Lagrangian exhibits remarkably worse conciseness and clarity, PAMA fails to improve these two objectives at all, and Tchebycheff collapses entirely under REINFORCE demonstrating poor generalization across RL algorithms. Instead, CTWA effectively mitigates cross-objective interference, achieving competitive performance on all objectives across different models and RL algorithms.

Theorems & Definitions (25)

  • Lemma 4.1
  • Theorem 4.2: First-order local covariance law
  • Remark 4.3
  • Lemma 4.4
  • Theorem 4.5: Fisher-covariance sufficient condition for natural gradient updates
  • Corollary 4.6: Categorical bandit case
  • Corollary 4.7: Clipping robustness
  • Remark 4.8
  • Definition 6.1
  • Theorem 6.5
  • ...and 15 more