Table of Contents
Fetching ...

Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

Javad Seraj, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi

TL;DR

The paper tackles the challenge of personalizing automatic evaluation with limited data by proposing a data-efficient data-augmentation pipeline to align open LLMs with a human reference judge. It introduces three augmentation strategies—Naïve Data Creation, Pool of Feedback, and Efficient Sampling—to generate and select preference data, then uses Direct Preference Optimization (DPO) with LoRA-tuned Llama-3.1-8B-Instruct to align the evaluator to a GPT-4 reference across math reasoning and truthful QA tasks. Results show that data augmentation, especially the Pool of Feedback and Efficient Sampling methods, improves alignment with the reference judge (e.g., up to ~0.63 Pearson on BigGen-Bench, vs ~0.54 with naïve data) and can surpass the base model in key tasks. The work demonstrates that effective data selection and reasoning-augmented augmentation can enable open LLMs to achieve competitive personalized evaluation with scarce data, offering practical implications for scalable, customized evaluation pipelines.

Abstract

Automatic evaluation by large language models (LLMs) is a prominent topic today; however, judgment and evaluation tasks are often subjective and influenced by various factors, making adaptation challenging. While many studies demonstrate the capabilities of state-of-the-art proprietary LLMs in comparison to human evaluators, they often struggle to adapt to reference evaluators over time, a requirement for achieving personalized judgment. Additionally, numerous works have attempted to apply open LLMs as judges or evaluators, but these efforts frequently overlook the limitations of working with scarce data. Personalized judgment is inherently associated with limited data scenarios, which are common in many real-world problems. Our work aims to present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline,and 30% improvement over the base model (Llama3.1-8B-Instruct) in the mathematical reasoning evaluation task. demonstrating that augmenting selecting more effective preference data enables our approach to surpass baseline methods.

Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

TL;DR

The paper tackles the challenge of personalizing automatic evaluation with limited data by proposing a data-efficient data-augmentation pipeline to align open LLMs with a human reference judge. It introduces three augmentation strategies—Naïve Data Creation, Pool of Feedback, and Efficient Sampling—to generate and select preference data, then uses Direct Preference Optimization (DPO) with LoRA-tuned Llama-3.1-8B-Instruct to align the evaluator to a GPT-4 reference across math reasoning and truthful QA tasks. Results show that data augmentation, especially the Pool of Feedback and Efficient Sampling methods, improves alignment with the reference judge (e.g., up to ~0.63 Pearson on BigGen-Bench, vs ~0.54 with naïve data) and can surpass the base model in key tasks. The work demonstrates that effective data selection and reasoning-augmented augmentation can enable open LLMs to achieve competitive personalized evaluation with scarce data, offering practical implications for scalable, customized evaluation pipelines.

Abstract

Automatic evaluation by large language models (LLMs) is a prominent topic today; however, judgment and evaluation tasks are often subjective and influenced by various factors, making adaptation challenging. While many studies demonstrate the capabilities of state-of-the-art proprietary LLMs in comparison to human evaluators, they often struggle to adapt to reference evaluators over time, a requirement for achieving personalized judgment. Additionally, numerous works have attempted to apply open LLMs as judges or evaluators, but these efforts frequently overlook the limitations of working with scarce data. Personalized judgment is inherently associated with limited data scenarios, which are common in many real-world problems. Our work aims to present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline,and 30% improvement over the base model (Llama3.1-8B-Instruct) in the mathematical reasoning evaluation task. demonstrating that augmenting selecting more effective preference data enables our approach to surpass baseline methods.

Paper Structure

This paper contains 21 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Different approaches for preference data creation. (a) The naïve data creation approach, (b) Pool of feedback approach, and (c) The efficient sampling approach.
  • Figure 2: Scores distribution for different LLMs in (a) BigGen-Bench and (b) TruthfulQA datasets. Note that the distribution of scores on BigGen-Bench is discrete, whereas the scores on TruthfulQA is continuous.
  • Figure 3: Selected data from BigGen-Bench benchmark, where generated feedback and respective scores are discrete. (a) The distribution of the entire dataset embeddings and (b) the distribution of selected samples using the efficient sampling approach. Note that both size is different to balance data across various scores.
  • Figure 4: Selected data from TruthfulQA benchmark, where generated feedback and respective scores are continuous. (a) The distribution of the entire dataset embeddings and (b) the distribution of selected samples using the efficient sampling approach. Note that both size is different to balance data across various scores.