Table of Contents
Fetching ...

Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets

Peter Devine

TL;DR

This work tackles inconsistency in AI evaluator rankings used to build preference datasets for RL from AI Feedback (RLAIF). It introduces Repeat Ranking, which retains only consistently ranked responses across multiple GPT-4 evaluations, quantified by Kendall's $W$, and applies this to create the Mitsu multilingual dataset. Training with ORPO on selected, high-consensus subsets (e.g., Suzume-ORPO-50) yields stronger MT-Bench performance across six languages compared to training on all data, while reducing data requirements. The findings demonstrate a quality-over-quantity principle for RLAIF data and offer a stackable, cost-efficient path to improved multilingual LLMs.

Abstract

Training Large Language Models (LLMs) with Reinforcement Learning from AI Feedback (RLAIF) aligns model outputs more closely with human preferences. This involves an evaluator model ranking multiple candidate responses to user prompts. However, the rankings from popular evaluator models such as GPT-4 can be inconsistent. We propose the Repeat Ranking method - where we evaluate the same responses multiple times and train only on those responses which are consistently ranked. Using 2,714 prompts in 62 languages, we generated responses from 7 top multilingual LLMs and had GPT-4 rank them five times each. Evaluating on MT-Bench chat benchmarks in six languages, our method outperformed the standard practice of training on all available prompts. Our work highlights the quality versus quantity trade-off in RLAIF dataset generation and offers a stackable strategy for enhancing dataset and thus model quality.

Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets

TL;DR

This work tackles inconsistency in AI evaluator rankings used to build preference datasets for RL from AI Feedback (RLAIF). It introduces Repeat Ranking, which retains only consistently ranked responses across multiple GPT-4 evaluations, quantified by Kendall's , and applies this to create the Mitsu multilingual dataset. Training with ORPO on selected, high-consensus subsets (e.g., Suzume-ORPO-50) yields stronger MT-Bench performance across six languages compared to training on all data, while reducing data requirements. The findings demonstrate a quality-over-quantity principle for RLAIF data and offer a stackable, cost-efficient path to improved multilingual LLMs.

Abstract

Training Large Language Models (LLMs) with Reinforcement Learning from AI Feedback (RLAIF) aligns model outputs more closely with human preferences. This involves an evaluator model ranking multiple candidate responses to user prompts. However, the rankings from popular evaluator models such as GPT-4 can be inconsistent. We propose the Repeat Ranking method - where we evaluate the same responses multiple times and train only on those responses which are consistently ranked. Using 2,714 prompts in 62 languages, we generated responses from 7 top multilingual LLMs and had GPT-4 rank them five times each. Evaluating on MT-Bench chat benchmarks in six languages, our method outperformed the standard practice of training on all available prompts. Our work highlights the quality versus quantity trade-off in RLAIF dataset generation and offers a stackable strategy for enhancing dataset and thus model quality.
Paper Structure (10 sections, 3 figures, 5 tables)

This paper contains 10 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A visual description of how we select our data for training. We use our Repeat Ranking method to repeat the evaluations of the models multiple times and then only train on the best and worst responses which have a high Kendall's W, a measure of ranking agreement, associated with their ranking.
  • Figure 2: Plots of how often each model's response was chosen as the positive/negative response for training using the Borda count. We observe that a plurality but not a majority of our positive training data comes from GPT-4, while the vast majority of our negative training data comes from responses by Starling and GPT-3.5-Turbo.
  • Figure 3: System message for generating evaluations