Table of Contents
Fetching ...

Aligning Large Language Models via Fully Self-Synthetic Data

Shangjian Yin, Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Yu Meng

TL;DR

SAO introduces a fully self-synthetic pipeline for aligning LLMs by generating prompts, responses, and preferences entirely within the model itself using persona roleplay, pairwise responses, and self-judgment. Preference optimization is performed with SimPO, employing length normalization and a margin to fuse self-generated signals into a training objective. Across benchmarks like AlpacaEval 2.0, MT-Bench, Arena-Hard, and Open LLM Leaderboard, SAO delivers substantial alignment gains without external labeled data and maintains downstream task performance, with robustness across multiple judges and iterative optimization. The results highlight scalable self-improvement as a viable path for improving chat abilities while reducing reliance on costly data collection and annotation.

Abstract

Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets, while Reinforcement Learning from AI Feedback (RLAIF) also incurs significant costs, requiring the collection of diverse prompts and corresponding responses, often necessitating external reward models or proprietary models like GPT-4 to annotate preference pairs. In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment, where all training data, including prompts (i.e., user queries), responses, and preferences, are generated by the model itself. Specifically, SAO first instructs the LLM to engage in persona role-play and generate diverse prompts and responses, which are then self-evaluated for preference optimization. Extensive experiments demonstrate that SAO effectively enhances the model's chat capabilities on standard benchmarks like AlpacaEval~2.0, while maintaining strong performance on downstream objective tasks (e.g., question-answering, math reasoning). Our work provides a practical solution for self-improvement in aligning LLMs, and the code for reproducing our results is available at: https://github.com/SJY8460/SAO.

Aligning Large Language Models via Fully Self-Synthetic Data

TL;DR

SAO introduces a fully self-synthetic pipeline for aligning LLMs by generating prompts, responses, and preferences entirely within the model itself using persona roleplay, pairwise responses, and self-judgment. Preference optimization is performed with SimPO, employing length normalization and a margin to fuse self-generated signals into a training objective. Across benchmarks like AlpacaEval 2.0, MT-Bench, Arena-Hard, and Open LLM Leaderboard, SAO delivers substantial alignment gains without external labeled data and maintains downstream task performance, with robustness across multiple judges and iterative optimization. The results highlight scalable self-improvement as a viable path for improving chat abilities while reducing reliance on costly data collection and annotation.

Abstract

Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets, while Reinforcement Learning from AI Feedback (RLAIF) also incurs significant costs, requiring the collection of diverse prompts and corresponding responses, often necessitating external reward models or proprietary models like GPT-4 to annotate preference pairs. In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment, where all training data, including prompts (i.e., user queries), responses, and preferences, are generated by the model itself. Specifically, SAO first instructs the LLM to engage in persona role-play and generate diverse prompts and responses, which are then self-evaluated for preference optimization. Extensive experiments demonstrate that SAO effectively enhances the model's chat capabilities on standard benchmarks like AlpacaEval~2.0, while maintaining strong performance on downstream objective tasks (e.g., question-answering, math reasoning). Our work provides a practical solution for self-improvement in aligning LLMs, and the code for reproducing our results is available at: https://github.com/SJY8460/SAO.

Paper Structure

This paper contains 34 sections, 7 equations, 4 figures, 13 tables, 1 algorithm.

Figures (4)

  • Figure 1: Impact of dataset size on model performance.
  • Figure 2: Impact of iterative optimization.
  • Figure 3: Distribution of prompt and response lengths.
  • Figure 4: The top box displays the persona instruction prompt, which directs the LLM to generate a specific prompt based on a given persona. The bottom box illustrates the pair-wise response ranking prompt, instructing the LLM to compare and rank responses based on specific criteria modified from shen2024boostingrewardmodelpreferenceconditional.