Table of Contents
Fetching ...

The Fellowship of the LLMs: Multi-Model Workflows for Synthetic Preference Optimization Dataset Generation

Samee Arif, Sualeha Farid, Abdul Hameed Azeemi, Awais Athar, Agha Ali Raza

TL;DR

The paper presents a multi-model workflow for synthetic Preference Optimization dataset generation, separating response evaluation and generation. It systematically compares LLM-based evaluators (Judge, Jury, Debate) and a multi-model LLM Feedback Loop for generation, identifying GPT-4o as a reliable judge and the Llama-Gemma pairing as particularly effective in generation. The study provides a detailed evaluation across multiple benchmarks, demonstrates significant win-rate gains over single-model baselines, and releases prompts and datasets to support reproducibility. Despite promising results, it acknowledges substantial computational costs and potential biases, suggesting careful deployment and future expansions with larger models and more iterations.

Abstract

This paper presents a novel methodology for generating synthetic Preference Optimization (PO) datasets using multi-model workflows. We evaluate the effectiveness and potential of these workflows in automating and enhancing the dataset generation process. PO dataset generation requires two modules: (1) $\textit{response evaluation}$, and (2) $\textit{response generation}$. In the $\textit{response evaluation}$ module, the responses from Large Language Models (LLMs) are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across all datasets. For the $\textit{response generation}$ module, we use the identified LLM evaluator configuration and compare different configurations of the LLM Feedback Loop. We use the win rate to determine the best multi-model configuration for generation. Experimenting with various configurations, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-model Llama and Gemma, respectively. After identifying the best configurations for both modules, we generate our PO datasets using the above pipeline.

The Fellowship of the LLMs: Multi-Model Workflows for Synthetic Preference Optimization Dataset Generation

TL;DR

The paper presents a multi-model workflow for synthetic Preference Optimization dataset generation, separating response evaluation and generation. It systematically compares LLM-based evaluators (Judge, Jury, Debate) and a multi-model LLM Feedback Loop for generation, identifying GPT-4o as a reliable judge and the Llama-Gemma pairing as particularly effective in generation. The study provides a detailed evaluation across multiple benchmarks, demonstrates significant win-rate gains over single-model baselines, and releases prompts and datasets to support reproducibility. Despite promising results, it acknowledges substantial computational costs and potential biases, suggesting careful deployment and future expansions with larger models and more iterations.

Abstract

This paper presents a novel methodology for generating synthetic Preference Optimization (PO) datasets using multi-model workflows. We evaluate the effectiveness and potential of these workflows in automating and enhancing the dataset generation process. PO dataset generation requires two modules: (1) , and (2) . In the module, the responses from Large Language Models (LLMs) are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across all datasets. For the module, we use the identified LLM evaluator configuration and compare different configurations of the LLM Feedback Loop. We use the win rate to determine the best multi-model configuration for generation. Experimenting with various configurations, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-model Llama and Gemma, respectively. After identifying the best configurations for both modules, we generate our PO datasets using the above pipeline.
Paper Structure (26 sections, 2 equations, 4 figures, 7 tables)

This paper contains 26 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: LLM Debate for evaluation
  • Figure 2: LLM Feedback Loop for response generation
  • Figure 3: Comparison of LLM Debate and LLM-as-a-Judge across the three datasets and different models.
  • Figure 4: Multi-model framework for PO dataset generation.