Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis
Leitian Tao, Xuefeng Du, Sharon Li
TL;DR
This work targets the data bottleneck in reward modeling for aligning large language models by introducing LENS, a latent-space data-synthesis framework that operates directly on response embeddings. A divergence-aware variational autoencoder learns a structured embedding space, enabling controlled latent perturbations and decoding back to embeddings to create diverse, semantically consistent synthetic preferences; this approach yields theoretical guarantees on preserving preference order and improving generalization. Empirically, latent-space synthesis outperforms text-based augmentation on HH-RLHF and TL;DR benchmarks, offering up to 18x faster generation and a 16,000x smaller generator, while enabling effective rejection-sampling-based SFT. The method significantly reduces computational costs and demonstrates strong generalization across model families, providing a practical, scalable path to better reward modeling for AI alignment.
Abstract
Reward modeling, crucial for aligning large language models (LLMs) with human preferences, is often bottlenecked by the high cost of preference data. Existing textual data synthesis methods are computationally expensive. We propose a novel framework LENS for synthesizing preference data directly in the LLM's latent embedding space. Our method employs a Variational Autoencoder (VAE) to learn a structured latent representation of response embeddings. By performing controlled perturbations in this latent space and decoding back to the embedding space, we efficiently generate diverse, semantically consistent synthetic preference pairs, bypassing costly text generation and annotation. We provide theoretical guarantees that our synthesized pairs approximately preserve original preference ordering and improve reward model generalization. Empirically, our latent-space synthesis significantly outperforms text-based augmentation on standard benchmarks, achieving superior results while being 18x faster in generation and using a 16,000x smaller model. Our work offers a scalable and effective alternative for enhancing reward modeling through efficient data augmentation. Code is publicly available at https://github.com/deeplearning-wisc/lens
