MAVIS: Multi-Objective Alignment via Inference-Time Value-Guided Selection
Jeremy Carleton, Debajoy Mukherjee, Srinivas Shakkottai, Dileep Kalathil
TL;DR
MAVIS tackles the challenge of aligning LLM outputs to multiple, potentially conflicting objectives without fine-tuning the base model. It trains small per-objective value models and uses them to tilt the reference policy at inference, forming a weighted, KL-regularized objective with a token-level Q-function basis. The approach guarantees monotone improvement in the KL-regularized value and extends to multi-objective settings by linearly combining per-objective Q-values, enabling dynamic trade-offs at inference time. Empirically, MAVIS achieves larger Pareto fronts than several fine-tuning baselines and competitive or superior performance compared with other inference-time methods, across multiple datasets and model sizes, with an emphasis on practical efficiency and flexibility.
Abstract
Large Language Models (LLMs) are increasingly deployed across diverse applications that demand balancing multiple, often conflicting, objectives -- such as helpfulness, harmlessness, or humor. Many traditional methods for aligning outputs to user-specific preferences require fine-tuning models for each objective or for specific preference configurations, which is computationally expensive and inflexible. We introduce \textbf{MAVIS} -- \textit{Multi-Objective Alignment via Inference-Time Value-Guided Selection} -- a lightweight inference-time alignment framework that enables dynamic control over LLM behavior without modifying the base model's weights. MAVIS trains a set of small value models, each corresponding to a distinct objective. At inference time, these value models are combined using user-specified weights to produce a tilting function that adjusts the base model's output distribution toward desired trade-offs. The value models are trained using a simple iterative algorithm that enables monotonic improvement of the KL-regularized policy. We show empirically that MAVIS achieves a superior pareto front compared to baselines which fine-tune per-objective models and combine them post hoc or train a single preference-conditioned value model for guidance. Our code is available at https://github.com/5-Jeremy/MAVIS/tree/main.
