Table of Contents
Fetching ...

Robust Multi-Objective Controlled Decoding of Large Language Models

Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

TL;DR

This work tackles test-time multi-objective alignment for large language models by addressing unknown relative importance across objectives. It introduces Robust Multi-Objective Decoding (RMOD), a max-min inference framework that computes worst-case objective weights via a convex optimization and derives a best-response policy analytically, all while regularizing toward the reference model with KL divergence. A practical RMOD variant, including block-wise decoding and weight-approximation techniques, enables low-latency deployment on contemporary LLMs. Empirical results on HH, UltraFeedback, and ValuePrism demonstrate robust, balanced alignment with significant improvements in worst-case rewards compared to baselines, and show favorable latency profiles. The approach offers a principled, inference-time solution to equitable multi-objective control in user- and task-specific contexts, with avenues for refinement in reward design and value-function accuracy.

Abstract

Test-time alignment of Large Language Models (LLMs) to human preferences offers a flexible way to generate responses aligned to diverse objectives without extensive retraining of LLMs. Existing methods achieve alignment to multiple objectives simultaneously (e.g., instruction-following, helpfulness, conciseness) by optimizing their corresponding reward functions. However, they often rely on predefined weights or optimize for averages, sacrificing one objective for another and leading to unbalanced outcomes. To address this, we introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that optimizes for improving worst-case rewards. RMOD formalizes the robust decoding problem as a maximin two-player game between reward weights and the sampling policy, solving for the Nash equilibrium. We show that the game reduces to a convex optimization problem to find the worst-case weights, while the best response policy can be computed analytically. We also introduce a practical RMOD variant designed for efficient decoding with contemporary LLMs, incurring minimal computational overhead compared to non-robust Multi-Objective Decoding (MOD) methods. Our experimental results showcase the effectiveness of RMOD in generating responses equitably aligned with diverse objectives, outperforming baselines up to 20%.

Robust Multi-Objective Controlled Decoding of Large Language Models

TL;DR

This work tackles test-time multi-objective alignment for large language models by addressing unknown relative importance across objectives. It introduces Robust Multi-Objective Decoding (RMOD), a max-min inference framework that computes worst-case objective weights via a convex optimization and derives a best-response policy analytically, all while regularizing toward the reference model with KL divergence. A practical RMOD variant, including block-wise decoding and weight-approximation techniques, enables low-latency deployment on contemporary LLMs. Empirical results on HH, UltraFeedback, and ValuePrism demonstrate robust, balanced alignment with significant improvements in worst-case rewards compared to baselines, and show favorable latency profiles. The approach offers a principled, inference-time solution to equitable multi-objective control in user- and task-specific contexts, with avenues for refinement in reward design and value-function accuracy.

Abstract

Test-time alignment of Large Language Models (LLMs) to human preferences offers a flexible way to generate responses aligned to diverse objectives without extensive retraining of LLMs. Existing methods achieve alignment to multiple objectives simultaneously (e.g., instruction-following, helpfulness, conciseness) by optimizing their corresponding reward functions. However, they often rely on predefined weights or optimize for averages, sacrificing one objective for another and leading to unbalanced outcomes. To address this, we introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that optimizes for improving worst-case rewards. RMOD formalizes the robust decoding problem as a maximin two-player game between reward weights and the sampling policy, solving for the Nash equilibrium. We show that the game reduces to a convex optimization problem to find the worst-case weights, while the best response policy can be computed analytically. We also introduce a practical RMOD variant designed for efficient decoding with contemporary LLMs, incurring minimal computational overhead compared to non-robust Multi-Objective Decoding (MOD) methods. Our experimental results showcase the effectiveness of RMOD in generating responses equitably aligned with diverse objectives, outperforming baselines up to 20%.

Paper Structure

This paper contains 35 sections, 4 theorems, 47 equations, 9 figures, 2 algorithms.

Key Result

Proposition 3.0

Given the value functions $V_g$ for each objective $g\in \mathcal{G}$, the solution to the inner maximization problem in eq:robust-objective-2 is unique for any given weights $w$ and trade-off parameter $\lambda$, and can be expressed as where, $Z(x,y^t,w)$ is a normalization constant.

Figures (9)

  • Figure 1: (Left) Existing multi-objective alignment methods require the weights for each reward. (Right) RMOD produces a robust response $y$ when a prompt $x$ is given, using the worst-case weights $w^*$ computed by solving a min-max problem. RMOD effectively improves the worst-case reward without requiring externally given weights.
  • Figure 2: Worst-case reward obtained by RMOD and baselines in the HH dataset. We use $B=16$ for all the decoding methods and $\lambda=0.5$ for RMOD. Texts at the top of bars indicate the reward or weighted sum of rewards used for the corresponding method. RS and MOD use the models trained with GRPO. RMOD shows significantly higher worst-case reward than all the baselines, regardless of whether they are fine-tuned or aligned at inference time.
  • Figure 3: (\ref{['fig:hh-winrates']}-\ref{['fig:hh-rewards-block16']}) Comparative study on the HH dataset between different decoding methods. We use $\lambda=0.5$ for RMOD. In \ref{['fig:hh-winrates']}, we present the worst-case win rates against the reference policy across block sizes $B\in\{16, 64, 256\}$. As $B$ decreases from 256 to 16, the worst-case win rate of RMOD increases, consistently outperforming the baselines. \ref{['fig:hh-rewards-block16']} show the rewards obtained with $B=16$ for different values of $K$, while sharing the same legend as \ref{['fig:hh-winrates']}. The purple star represents the average reward of $\pi_\mathrm{ref}$, and the dots represent increasing K values (2, 4, 8, 16) as they move away from the purple star. RMOD improves the worst-case reward, having higher harmlessness reward than Uniform. (\ref{['fig:hh-rewards-lambda']}) Testing different values of $\lambda$ for RMOD in the HH dataset. We ablate the performance of RMOD against the value of $\lambda$ with $B=16$ and demonstrate that smaller values of $\lambda$ reduces RMOD to Uniform decoding. On the other hand, as $\lambda$ increases, RMOD concentrates on improving the worst-case reward.
  • Figure 4: Performance comparison of decoding algorithms on UltraFeedback (\ref{['fig:uf-winrates']}- \ref{['fig:uf-radar']}). \ref{['fig:uf-winrates']} displays worst-case win rates in the UltraFeedback dataset for block sizes $B\in\{4, 8, 16, 32, 128\}$ and $K = 16$. RMOD achieves higher than 57% win rate against the reference policy and consistently outperforms Uniform decoding. \ref{['fig:uf-radar']} displays average reward in the UltraFeedback dataset with $B=4$. The purple star denotes the worst-case reward of RMOD and corresponds to the conciseness objective, and is much higher than that of Uniform decoding and Conciseness (orange, green dots).
  • Figure 5: Analysis of RMOD's weight and value predictions on UltraFeedback dataset while generating a response with $K=16$ candidates, and block size $B=16$ for a single prompt. RMOD adapts its weights for each block and follows the dynamic changes in worst-case value, mainly between conciseness and honesty in this case. We note that RMOD's generated response significantly outperforms the response generated by Uniform decoding in terms of worst-case reward and highlights the robustness of our method.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Proposition 3.0
  • Proposition 3.0
  • Proposition A.0
  • proof
  • Proposition A.0
  • proof