Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation
Javad Seraj, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi
TL;DR
The paper tackles the challenge of personalizing automatic evaluation with limited data by proposing a data-efficient data-augmentation pipeline to align open LLMs with a human reference judge. It introduces three augmentation strategies—Naïve Data Creation, Pool of Feedback, and Efficient Sampling—to generate and select preference data, then uses Direct Preference Optimization (DPO) with LoRA-tuned Llama-3.1-8B-Instruct to align the evaluator to a GPT-4 reference across math reasoning and truthful QA tasks. Results show that data augmentation, especially the Pool of Feedback and Efficient Sampling methods, improves alignment with the reference judge (e.g., up to ~0.63 Pearson on BigGen-Bench, vs ~0.54 with naïve data) and can surpass the base model in key tasks. The work demonstrates that effective data selection and reasoning-augmented augmentation can enable open LLMs to achieve competitive personalized evaluation with scarce data, offering practical implications for scalable, customized evaluation pipelines.
Abstract
Automatic evaluation by large language models (LLMs) is a prominent topic today; however, judgment and evaluation tasks are often subjective and influenced by various factors, making adaptation challenging. While many studies demonstrate the capabilities of state-of-the-art proprietary LLMs in comparison to human evaluators, they often struggle to adapt to reference evaluators over time, a requirement for achieving personalized judgment. Additionally, numerous works have attempted to apply open LLMs as judges or evaluators, but these efforts frequently overlook the limitations of working with scarce data. Personalized judgment is inherently associated with limited data scenarios, which are common in many real-world problems. Our work aims to present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline,and 30% improvement over the base model (Llama3.1-8B-Instruct) in the mathematical reasoning evaluation task. demonstrating that augmenting selecting more effective preference data enables our approach to surpass baseline methods.
