GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

Shen Zheng; Yuyu Zhang; Yijie Zhu; Chenguang Xi; Pengyang Gao; Xun Zhou; Kevin Chen-Chuan Chang

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang

TL;DR

GPT-Fathom presents an open-source, reproducible evaluation suite for large language models, built atop OpenAI Evals, to enable apples-to-apples comparisons across 10+ models and 20+ benchmarks in 7 capability areas. It provides a retrospective analysis of OpenAI's GPT-3 to GPT-4 evolution, examining how code data, SFT, and RLHF shape capabilities and the alignment tax. The study highlights a seesaw phenomenon in capabilities, significant prompt sensitivity, and the differential impact of training data and alignment techniques, offering guidance for more transparent benchmarking. Overall, GPT-Fathom serves as a standard gauge for positioning new LLMs and diagnosing gaps to bridge toward GPT-4 and beyond.

Abstract

With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

TL;DR

Abstract

Paper Structure (30 sections, 3 figures, 16 tables)

This paper contains 30 sections, 3 figures, 16 tables.

Introduction
Related Work
Method
Benchmarks for Evaluation
Details of Black-box Evaluation
Experiments
Overall Performance
Analysis and Insights
Conclusions
Details of Evaluated LLMs
Details of Benchmark Datasets
Details of Evaluation
Sampling Hyperparameters
Evaluation Prompts
Answer Parsing and Metric Computation
...and 15 more sections

Figures (3)

Figure 1: OpenAI's evolutionary path from GPT-3 to GPT-4. We omit deprecated legacy models such as code-davinci-001 and only list the models evaluated in GPT-Fathom.
Figure 2: Radar charts to visualize the capabilities of evaluated LLMs. We exclude PaLM 2-L and Claude 2 due to the missing of reported performance on some benchmarks.
Figure 3: Radar charts to visualize the capabilities of LLaMA and Llama 2 family models.

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

TL;DR

Abstract

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

Authors

TL;DR

Abstract

Table of Contents

Figures (3)