Table of Contents
Fetching ...

MAVIS: Multi-Objective Alignment via Inference-Time Value-Guided Selection

Jeremy Carleton, Debajoy Mukherjee, Srinivas Shakkottai, Dileep Kalathil

TL;DR

MAVIS tackles the challenge of aligning LLM outputs to multiple, potentially conflicting objectives without fine-tuning the base model. It trains small per-objective value models and uses them to tilt the reference policy at inference, forming a weighted, KL-regularized objective with a token-level Q-function basis. The approach guarantees monotone improvement in the KL-regularized value and extends to multi-objective settings by linearly combining per-objective Q-values, enabling dynamic trade-offs at inference time. Empirically, MAVIS achieves larger Pareto fronts than several fine-tuning baselines and competitive or superior performance compared with other inference-time methods, across multiple datasets and model sizes, with an emphasis on practical efficiency and flexibility.

Abstract

Large Language Models (LLMs) are increasingly deployed across diverse applications that demand balancing multiple, often conflicting, objectives -- such as helpfulness, harmlessness, or humor. Many traditional methods for aligning outputs to user-specific preferences require fine-tuning models for each objective or for specific preference configurations, which is computationally expensive and inflexible. We introduce \textbf{MAVIS} -- \textit{Multi-Objective Alignment via Inference-Time Value-Guided Selection} -- a lightweight inference-time alignment framework that enables dynamic control over LLM behavior without modifying the base model's weights. MAVIS trains a set of small value models, each corresponding to a distinct objective. At inference time, these value models are combined using user-specified weights to produce a tilting function that adjusts the base model's output distribution toward desired trade-offs. The value models are trained using a simple iterative algorithm that enables monotonic improvement of the KL-regularized policy. We show empirically that MAVIS achieves a superior pareto front compared to baselines which fine-tune per-objective models and combine them post hoc or train a single preference-conditioned value model for guidance. Our code is available at https://github.com/5-Jeremy/MAVIS/tree/main.

MAVIS: Multi-Objective Alignment via Inference-Time Value-Guided Selection

TL;DR

MAVIS tackles the challenge of aligning LLM outputs to multiple, potentially conflicting objectives without fine-tuning the base model. It trains small per-objective value models and uses them to tilt the reference policy at inference, forming a weighted, KL-regularized objective with a token-level Q-function basis. The approach guarantees monotone improvement in the KL-regularized value and extends to multi-objective settings by linearly combining per-objective Q-values, enabling dynamic trade-offs at inference time. Empirically, MAVIS achieves larger Pareto fronts than several fine-tuning baselines and competitive or superior performance compared with other inference-time methods, across multiple datasets and model sizes, with an emphasis on practical efficiency and flexibility.

Abstract

Large Language Models (LLMs) are increasingly deployed across diverse applications that demand balancing multiple, often conflicting, objectives -- such as helpfulness, harmlessness, or humor. Many traditional methods for aligning outputs to user-specific preferences require fine-tuning models for each objective or for specific preference configurations, which is computationally expensive and inflexible. We introduce \textbf{MAVIS} -- \textit{Multi-Objective Alignment via Inference-Time Value-Guided Selection} -- a lightweight inference-time alignment framework that enables dynamic control over LLM behavior without modifying the base model's weights. MAVIS trains a set of small value models, each corresponding to a distinct objective. At inference time, these value models are combined using user-specified weights to produce a tilting function that adjusts the base model's output distribution toward desired trade-offs. The value models are trained using a simple iterative algorithm that enables monotonic improvement of the KL-regularized policy. We show empirically that MAVIS achieves a superior pareto front compared to baselines which fine-tune per-objective models and combine them post hoc or train a single preference-conditioned value model for guidance. Our code is available at https://github.com/5-Jeremy/MAVIS/tree/main.

Paper Structure

This paper contains 38 sections, 4 theorems, 28 equations, 7 figures, 16 tables, 3 algorithms.

Key Result

Theorem 1

Consider a general infinite-horizon discounted MDP. Define the regularized value of a policy $\pi$ as follows: Consider the following update rule applied over all state-action pairs, which starts with $\pi^0 = \pi^{\textnormal{ref}}$: Under standard conditions on $\pi^{\textnormal{ref}}$ and the MDP, repeated application of this update rule ensures monotonic improvement in the Q-value for $\pi^{

Figures (7)

  • Figure 1: Overview of how MAVIS is trained and used in inference. (1) Responses to prompts from a pre-selected dataset are generated using the existing policy (either $\pi^\textrm{ref}$ or the previous MAVIS policy) and labeled with each objective’s reward. (2) A separate value model is learned for each objective by regressing on the values derived from the appropriate reward model. (3) The MAVIS decoding process uses these value models together with $\pi^{\textnormal{ref}}$ as shown in \ref{['fig:decoding_diagram']}. (4) Repeatedly training the value models with data generated from the latest MAVIS policy leads to expansion of the achievable pareto front.
  • Figure 2: (top): MAVIS enables flexible inference-time alignment unlike standard decoding. (bottom): Pareto front comparison for MAVIS and fine-tuning baselines with a Llama-2 7B model as $\pi^{\textnormal{ref}}$.
  • Figure 3: Overview of the MAVIS decoding procedure for a single token. The generative LLM $\pi^{\textnormal{ref}}$ is first queried to get a probability distribution over next tokens, then the tokens with the highest probabilities are selected and evaluated by a set of value models, one for each of the $M$ objectives. The per-objective values are combined according to user-specified weights on the objectives given by $\lambda_1 \cdots \lambda_M$, and these combined values are used to re-weight the original probabilities of the top tokens, forming a new probability distribution from which the next token is sampled.
  • Figure 4: Pareto front comparison between MAVIS and the fine-tuning baseline algorithms for (a) helpfulness vs harmlessness, (b) helpfulness vs humor, (c) harmlessness vs humor using Llama 13B as the generative model.
  • Figure 5: (a/b) Pareto front comparison between MAVIS and the baseline algorithms for the Summarize from Feedback dataset with Llama-2 7B (a) and Llama-2 13B (b) as the generative model. (c) Pareto front comparison between MAVIS and PARM.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Definition 2
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • proof