Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

Haipeng Luo; Qingfeng Sun; Can Xu; Pu Zhao; Qingwei Lin; Jianguang Lou; Shifeng Chen; Yansong Tang; Weizhu Chen

Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, Weizhu Chen

TL;DR

Arena Learning introduces an offline, AI-driven arena framework that simulates LMSYS-style chatbot battles to create a scalable data flywheel for post-training LLM improvements. It leverages a judge LLM to generate battle judgments and WizardArena to align offline Elo rankings with online benchmarks, enabling fully automated SFT, DPO, and PPO updates to WizardLM-β. Empirical results show high offline-online alignment (about 98.8% consistency with LMSYS) and significant performance gains across MT-Bench, AlpacaEval LC, and OpenLLM Leaderboard over multiple data-flywheel iterations. The approach demonstrates a cost-effective, scalable path to continuously advance LLM capabilities through synthetic arena data and iterative training, with strong implications for WizardLM-2 derivatives.

Abstract

Assessing the effectiveness of large language models (LLMs) presents substantial challenges. The method of conducting human-annotated battles in an online Chatbot Arena is a highly effective evaluative technique. However, this approach is limited by the costs and time required for human annotation. In this paper, we introduce Arena Learning, an innovative offline strategy designed to simulate these arena battles using AI-driven annotations to evaluate battle outcomes, thus facilitating the continuous improvement of the target model through both supervised fine-tuning and reinforcement learning. Arena Learning comprises two key elements. First, it ensures precise evaluations and maintains consistency between offline simulations and online competitions via WizardArena, a pipeline developed to accurately predict the Elo rankings of various models using a meticulously designed offline test set. Our results demonstrate that WizardArena's predictions closely align with those from the online Arena. Second, it involves the continuous improvement of training data based on the battle results and the refined model. We establish a data flywheel to iteratively update the training data by highlighting the weaknesses of the target model based on its battle results, enabling it to learn from the strengths of multiple different models. We apply Arena Learning to train our target model, WizardLM-$β$, and demonstrate significant performance enhancements across various metrics. This fully automated training and evaluation pipeline sets the stage for continuous advancements in various LLMs via post-training. Notably, Arena Learning plays a pivotal role in the success of WizardLM-2, and this paper serves both as an exploration of its efficacy and a foundational study for future discussions related to WizardLM-2 and its derivatives.

Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

TL;DR

Abstract

, and demonstrate significant performance enhancements across various metrics. This fully automated training and evaluation pipeline sets the stage for continuous advancements in various LLMs via post-training. Notably, Arena Learning plays a pivotal role in the success of WizardLM-2, and this paper serves both as an exploration of its efficacy and a foundational study for future discussions related to WizardLM-2 and its derivatives.

Paper Structure (20 sections, 3 equations, 12 figures, 11 tables)

This paper contains 20 sections, 3 equations, 12 figures, 11 tables.

Introduction
Approach
ChatBot Arena and Elo Ranking
Using a Powerful LLM as Judge to Simulate Human Annotators
Build a Data Flywheel to Post-train LLMs
Collect Large-Scale Instruction Data
Iterative Battle and Model Evolving
Evaluate LLMs with WizardArena
Experiments
Experimental Setup
Offline WizardArena closely align with the Online LMSYS ChatBot Arena.
Can Arena Learning build an effective data flywheel with post-training?
Scaling Iterative SFT, DPO, and PPO with Arena Learning .
Ablation Study
Related Works
...and 5 more sections

Figures (12)

Figure 1: OpenRouter LLM Rankings on processed tokens (https://openrouter.ai/rankings).
Figure 2: Overview of Arena Learning post-training data flywheel and WizardArena evaluation.
Figure 3: Overview of Running Example: how we use simulated AI-powered pair wise battle arena to produce post-training data and evaluate models.
Figure 4: WizardArena-Mix Turn statistics
Figure 5: WizardArena-Mix Category statistics
...and 7 more figures

Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

TL;DR

Abstract

Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

Authors

TL;DR

Abstract

Table of Contents

Figures (12)