Table of Contents
Fetching ...

Transfer Q Star: Principled Decoding for LLM Alignment

Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, Amrit Singh Bedi, Furong Huang

TL;DR

This paper addresses the challenge of aligning large language models without expensive fine-tuning by developing Transfer Q* (TQ*), a decoding-time method that estimates the optimal value function for a target reward using an existing baseline RLHF-aligned model. It introduces direct and indirect transfer variants to accommodate different baseline alignments and provides rigorous theoretical bounds on suboptimality and KL divergence. Empirically, TQ* demonstrates superior performance over state-of-the-art decoding methods in terms of reward, coherence, and diversity across synthetic and real transfer tasks and model families. The approach offers a scalable, principled pathway for efficient, deployment-time alignment of LLMs with practical guarantees.

Abstract

Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward $r$, thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function ($Q^*$), which is often unavailable in practice. Hence, prior SoTA methods either approximate this $Q^*$ using $Q^{π_{\texttt{sft}}}$ (derived from the reference $\texttt{SFT}$ model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose Transfer $Q^*$, which implicitly estimates the optimal value function for a target reward $r$ through a baseline model $ρ_{\texttt{BL}}$ aligned with a baseline reward $ρ_{\texttt{BL}}$ (which can be different from the target reward $r$). Theoretical analyses of Transfer $Q^*$ provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference $\texttt{SFT}$ model based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.

Transfer Q Star: Principled Decoding for LLM Alignment

TL;DR

This paper addresses the challenge of aligning large language models without expensive fine-tuning by developing Transfer Q* (TQ*), a decoding-time method that estimates the optimal value function for a target reward using an existing baseline RLHF-aligned model. It introduces direct and indirect transfer variants to accommodate different baseline alignments and provides rigorous theoretical bounds on suboptimality and KL divergence. Empirically, TQ* demonstrates superior performance over state-of-the-art decoding methods in terms of reward, coherence, and diversity across synthetic and real transfer tasks and model families. The approach offers a scalable, principled pathway for efficient, deployment-time alignment of LLMs with practical guarantees.

Abstract

Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward , thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function (), which is often unavailable in practice. Hence, prior SoTA methods either approximate this using (derived from the reference model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose Transfer , which implicitly estimates the optimal value function for a target reward through a baseline model aligned with a baseline reward (which can be different from the target reward ). Theoretical analyses of Transfer provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference model based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.
Paper Structure (34 sections, 2 theorems, 52 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 2 theorems, 52 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

For the proposed Transfer Q$^{\star}$ Algorithm algorithm_IQ, the following results hold. (1) Suboptimality gap for all ${\mathbf{x}}$ is upper bounded as where $\beta$ is defined in (dec_new1) for baseline policy, and $\alpha$ is defined in (final_objective) for decoding process. Here $h_\alpha({\mathbf{x}})\geq 0$ and its formula is defined in Appendix supp:sec:main_proof. (2) Assume reward sat

Figures (8)

  • Figure 1: Left. This figure highlights the conceptual idea of proposed transfer decoding in this work. It clearly shows that the current SoTA method mudgal2024controlled exhibits suboptimality with respect to alignment with the target reward denoted by the dotted red arrow. On the other hand, the proposed transfer decoding method utilizes an immediately available aligned language model called the baseline, which is aligned with some baseline reward $r_{\texttt{BL}}$ to bridge the gap between the SoTA method and the target model. Right. This figure provides empirical evidence of the performance gap of the current SoTA decoding strategy mudgal2024controlled with respect to Oracle (best of $N$ sampling). Our proposed Transfer Q$^{\star}$ (TQ$^{\star}$) reduces the gap and provides a new decoding method.
  • Figure 2: In plots (a), (c), and (d) we present the normalized average reward values obtained using the corresponding setup outlined in Table \ref{['tab:setup_indv']}. ARGS (SFT) and ARGS (DPO) refer to the reward modeling approach described in khanov2024args to the SFT and DPO model respectively. Our analysis reveals that across all setups, TQ$^{\star}$ consistently outperforms other competitive baselines summarized in Table \ref{['tab:setup_indv']}, demonstrating its superior efficacy. We report results on other evaluation setups in Appendix \ref{['app:exp_eval_additional']}. In (b), we compare (for Evaluation-1 setup) the trajectory-level KL Divergence between different decoding policies and the base model $\rho_{\texttt{sft}}$ to show the effectiveness of the proposed approach compared to the state-of-the-art.
  • Figure 3: Diversity and Coherence analysis of generated responses. We observe that the responses generated using TQ$^{\star}$ obtain the highest coherence and diversity. These results are based on the prompts from the Berkeley Nectar dataset.
  • Figure 4: Evaluation for Synthetic Indirect Transfer Tasks. We plot the distribution of the reward values for the source and two transfer tasks on the Ultrafeedback in (a) and (c). The reward model architecture is Mistral-7B-$\alpha$jiang2023mistral. In (b) and (d), we compare the normalized average reward scores for competitive decoding strategies. We represent the variant of our decoding strategy with direct transfer as DT. We observe that TQ$^{\star}$ consistently outperforms the other baselines. Results on other datasets are reported in Appendix \ref{['app:indirect_additional_results']}.
  • Figure 5: Evaluation for Real Indirect Transfer Tasks. In (a) and (c), we visualize the distribution shift in reward values between the source and target for Setup-1 and Setup-2, respectively, as outlined in Table \ref{['tab:real_transfer']}. In (b) and (d), we report the normalized average reward scores of different decoding strategies corresponding to Setup-1 and Setup-2, respectively.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2: Restatement of Theorem \ref{['main_theorem']}
  • Remark 1