Table of Contents
Fetching ...

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

Somanshu Singla, Zhen Wang, Tianyang Liu, Abdullah Ashfaq, Zhiting Hu, Eric P. Xing

TL;DR

Empirical evaluations reveal that DRPO significantly enhances alignment performance, enabling base models to outperform their SFT/RLHF-tuned counterparts and envision a highly cost-effective and adaptable solution for future alignment research to be further explored.

Abstract

Aligning Large Language Models (LLMs) traditionally relies on costly training and human preference annotations. Self-alignment seeks to reduce these expenses by enabling models to align themselves. To further lower costs and achieve alignment without any expensive tuning or annotations, we introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (DRPO). Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions, all without additional training or human intervention. The core of DRPO is a dynamic rewarding mechanism, which identifies and rectifies model-specific alignment weaknesses, allowing LLMs to adapt efficiently to diverse alignment challenges. Empirical evaluations on eight recent LLMs, both open- and closed-sourced, demonstrate that DRPO significantly enhances alignment performance, with base models outperforming their SFT/RLHF-tuned counterparts. Moreover, the prompts automatically optimized by DRPO surpass those curated by human experts, further validating the effectiveness of our approach. Our findings highlight the great potential of current LLMs to achieve adaptive self-alignment through inference-time optimization, complementing tuning-based alignment methods.

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

TL;DR

Empirical evaluations reveal that DRPO significantly enhances alignment performance, enabling base models to outperform their SFT/RLHF-tuned counterparts and envision a highly cost-effective and adaptable solution for future alignment research to be further explored.

Abstract

Aligning Large Language Models (LLMs) traditionally relies on costly training and human preference annotations. Self-alignment seeks to reduce these expenses by enabling models to align themselves. To further lower costs and achieve alignment without any expensive tuning or annotations, we introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (DRPO). Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions, all without additional training or human intervention. The core of DRPO is a dynamic rewarding mechanism, which identifies and rectifies model-specific alignment weaknesses, allowing LLMs to adapt efficiently to diverse alignment challenges. Empirical evaluations on eight recent LLMs, both open- and closed-sourced, demonstrate that DRPO significantly enhances alignment performance, with base models outperforming their SFT/RLHF-tuned counterparts. Moreover, the prompts automatically optimized by DRPO surpass those curated by human experts, further validating the effectiveness of our approach. Our findings highlight the great potential of current LLMs to achieve adaptive self-alignment through inference-time optimization, complementing tuning-based alignment methods.

Paper Structure

This paper contains 31 sections, 8 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: Comparison of DRPO with other LLM alignment paradigms. DRPO combines the benefits of self-alignment and tuning-free alignment, enabling self-improvement and high cost-efficiency without requiring human supervision or additional model training.
  • Figure 2: Comparison of DRPO with other alignment methods, including RLHF and URIAL Lin2024ReAlign. DRPO consistently outperforms both baselines across multiple LLMs. Note that we do not have access to gpt-3.5-turbo base model; hence, both DRPO and URIAL are directly applied to its RLHF-tuned version.
  • Figure 3: Overall framework of Dynamic Rewarding with Prompt Optimization (DRPO). The optimization problem is modeled as a Markov Decision Process (MDP) and solved using beam search to optimize the alignment prompt. Dynamic rewarding, a novel technique integrated into this framework, allows flexible reward assignment to detect and address alignment weaknesses in the current LLM, thereby enhancing the overall optimization process.
  • Figure 4: Performance of Mistral 7b (Instruct) on varying the number of ICL examples. Two examples give us the best performance with a lower context length cost.
  • Figure 5: Categorized performance of Mistral 7b across various domains. Using DRPO we see a strong improvement in performance across all domains. Notably, we can see that domains like Humanities, Reasoning, STEM improves significantly. This highlights the fact that base models can benefit a great deal from DRPO.
  • ...and 2 more figures