Table of Contents
Fetching ...

Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis

Leitian Tao, Xuefeng Du, Sharon Li

TL;DR

This work targets the data bottleneck in reward modeling for aligning large language models by introducing LENS, a latent-space data-synthesis framework that operates directly on response embeddings. A divergence-aware variational autoencoder learns a structured embedding space, enabling controlled latent perturbations and decoding back to embeddings to create diverse, semantically consistent synthetic preferences; this approach yields theoretical guarantees on preserving preference order and improving generalization. Empirically, latent-space synthesis outperforms text-based augmentation on HH-RLHF and TL;DR benchmarks, offering up to 18x faster generation and a 16,000x smaller generator, while enabling effective rejection-sampling-based SFT. The method significantly reduces computational costs and demonstrates strong generalization across model families, providing a practical, scalable path to better reward modeling for AI alignment.

Abstract

Reward modeling, crucial for aligning large language models (LLMs) with human preferences, is often bottlenecked by the high cost of preference data. Existing textual data synthesis methods are computationally expensive. We propose a novel framework LENS for synthesizing preference data directly in the LLM's latent embedding space. Our method employs a Variational Autoencoder (VAE) to learn a structured latent representation of response embeddings. By performing controlled perturbations in this latent space and decoding back to the embedding space, we efficiently generate diverse, semantically consistent synthetic preference pairs, bypassing costly text generation and annotation. We provide theoretical guarantees that our synthesized pairs approximately preserve original preference ordering and improve reward model generalization. Empirically, our latent-space synthesis significantly outperforms text-based augmentation on standard benchmarks, achieving superior results while being 18x faster in generation and using a 16,000x smaller model. Our work offers a scalable and effective alternative for enhancing reward modeling through efficient data augmentation. Code is publicly available at https://github.com/deeplearning-wisc/lens

Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis

TL;DR

This work targets the data bottleneck in reward modeling for aligning large language models by introducing LENS, a latent-space data-synthesis framework that operates directly on response embeddings. A divergence-aware variational autoencoder learns a structured embedding space, enabling controlled latent perturbations and decoding back to embeddings to create diverse, semantically consistent synthetic preferences; this approach yields theoretical guarantees on preserving preference order and improving generalization. Empirically, latent-space synthesis outperforms text-based augmentation on HH-RLHF and TL;DR benchmarks, offering up to 18x faster generation and a 16,000x smaller generator, while enabling effective rejection-sampling-based SFT. The method significantly reduces computational costs and demonstrates strong generalization across model families, providing a practical, scalable path to better reward modeling for AI alignment.

Abstract

Reward modeling, crucial for aligning large language models (LLMs) with human preferences, is often bottlenecked by the high cost of preference data. Existing textual data synthesis methods are computationally expensive. We propose a novel framework LENS for synthesizing preference data directly in the LLM's latent embedding space. Our method employs a Variational Autoencoder (VAE) to learn a structured latent representation of response embeddings. By performing controlled perturbations in this latent space and decoding back to the embedding space, we efficiently generate diverse, semantically consistent synthetic preference pairs, bypassing costly text generation and annotation. We provide theoretical guarantees that our synthesized pairs approximately preserve original preference ordering and improve reward model generalization. Empirically, our latent-space synthesis significantly outperforms text-based augmentation on standard benchmarks, achieving superior results while being 18x faster in generation and using a 16,000x smaller model. Our work offers a scalable and effective alternative for enhancing reward modeling through efficient data augmentation. Code is publicly available at https://github.com/deeplearning-wisc/lens

Paper Structure

This paper contains 49 sections, 5 theorems, 40 equations, 5 figures, 6 tables.

Key Result

Theorem 1

(Informal). Under mild conditions, for any preference LLM embedding ${\mathbf{e}}\sim\mathcal{E}$, sample a latent vector $\mathbf{z}\sim {q}_\phi(\cdot|\mathbf{e})$, if there exists a constant $\epsilon_{\mathrm{rec}}$ that satisfies $\|g_{\theta}({q}_\phi(\mathbf{z}|\mathbf{e})) - \mathbf{e}\| \le where $d_{\mathrm{VAE}}$ is the dimension of the VAE latent space, and $({\mathbf{e}}^+, {\mathbf{e

Figures (5)

  • Figure 1: Comparison of textual space synthesis (top) and latent space synthesis (bottom). Latent space synthesis operates on embeddings, offering significant computational advantages. Best viewed in color.
  • Figure 2: (a) Effect of weight of $\mathcal{L}_\text{divergence}$. (b) Effect of noise variance $\sigma^2$ during synthesis. (c) Ablation on the number of initial training samples (in thousands).
  • Figure 3: The t-SNE visualization of VAE latent space with different levels of divergence regularization, controlled through the loss weight $\gamma$.
  • Figure 4: (a) Ablation on the KL divergence weight $\beta$. (b) Performance using original, synthetic, or combined training data.
  • Figure 5: Log-log plot of VAE reconstruction error ($\epsilon_{\mathrm{rec}}$) against the training dataset size ($N$). The linear trend supports the power-law decay $\epsilon_{\mathrm{rec}} = \mathcal{O}(N^{-p})$, with an estimated $p \approx 0.26$.

Theorems & Definitions (12)

  • Definition 1: Preference Data.
  • Theorem 1
  • Theorem 2
  • Definition 1: $L_g$-Lipschitz
  • Definition 2: $\alpha$-Hölder continuous
  • Remark 1
  • Theorem 1: Formal
  • Theorem 2: Formal
  • proof : Proof of Theorem \ref{['thm:1-app']}
  • proof : Proof of Theorem \ref{['thm:2-app']}
  • ...and 2 more