Table of Contents
Fetching ...

DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models

Ruizhe Chen, Wenhao Chai, Zhifei Yang, Xiaotian Zhang, Joey Tianyi Zhou, Tony Quek, Soujanya Poria, Zuozhu Liu

TL;DR

DiffPO addresses the scalability and latency limitations of traditional RLHF by reframing alignment as a sentence-level diffusion-like denoising process that operates during inference. It introduces a plug-and-play, model-agnostic module that uses parallel decoding and consistency objectives to transform unaligned sentence generations into aligned outputs without full retraining. Empirical results on MT-bench, AlpacaEval 2, and HH-RLHF show DiffPO improves alignment quality while maintaining favorable inference-time efficiency, and scales effectively to larger base models. The approach offers a practical pathway to robust, human-aligned behavior across diverse LLMs with minimal retraining requirements.

Abstract

Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (\model), which provides an efficient and policy-agnostic solution for aligning LLMs with humans. By directly performing alignment at sentence level, \model~avoids the time latency associated with token-level generation. Designed as a plug-and-play module, \model~can be seamlessly integrated with various base models to enhance their alignment. Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that \model~achieves superior alignment performance across various settings, achieving a favorable trade-off between alignment quality and inference-time latency. Furthermore, \model~demonstrates model-agnostic scalability, significantly improving the performance of large models such as Llama-3-70B.

DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models

TL;DR

DiffPO addresses the scalability and latency limitations of traditional RLHF by reframing alignment as a sentence-level diffusion-like denoising process that operates during inference. It introduces a plug-and-play, model-agnostic module that uses parallel decoding and consistency objectives to transform unaligned sentence generations into aligned outputs without full retraining. Empirical results on MT-bench, AlpacaEval 2, and HH-RLHF show DiffPO improves alignment quality while maintaining favorable inference-time efficiency, and scales effectively to larger base models. The approach offers a practical pathway to robust, human-aligned behavior across diverse LLMs with minimal retraining requirements.

Abstract

Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (\model), which provides an efficient and policy-agnostic solution for aligning LLMs with humans. By directly performing alignment at sentence level, \model~avoids the time latency associated with token-level generation. Designed as a plug-and-play module, \model~can be seamlessly integrated with various base models to enhance their alignment. Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that \model~achieves superior alignment performance across various settings, achieving a favorable trade-off between alignment quality and inference-time latency. Furthermore, \model~demonstrates model-agnostic scalability, significantly improving the performance of large models such as Llama-3-70B.

Paper Structure

This paper contains 48 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison with Inference-Time Methods. Points closer to the top-right indicate a superior trade-off between performance and inference time.
  • Figure 2: Illustration of the DiffPO Framework. (a) The objective of LLM alignment is to adjust the output of LLMs to reflect human values and intentions. In this process, preferences are considered at the sentence level, focusing on aspects such as the style and format of the complete output. (b) We propose Diffusion-style Preference Optimization (DiffPO), which reconceptualizes the alignment process as a sentence-level denoising process, where the goal is to transform an unaligned sentence $\mathbf{y}^{(0)}$ into an aligned sentence $\mathbf{y}^{(T)}$ step by step. (c) Designed as a plug-and-play module, DiffPO can be directly integrated with the base model output and yield better alignment.
  • Figure 3: Comparison of Inference-Time Efficiency. We compare DiffPO with existing inference-time alignment techniques, evaluating both alignment performance and execution time. Points located closer to the top-right corner indicate a better trade-off. When considering both aspects, DiffPO demonstrates a surpassing performance-efficiency trade-off on all three datasets.
  • Figure 4: Illustration of the Speedup of DiffPO.