Table of Contents
Fetching ...

A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models

Junjie Ye, Xuanting Chen, Nuo Xu, Can Zu, Zekai Shao, Shichun Liu, Yuhan Cui, Zeyang Zhou, Chao Gong, Yang Shen, Jie Zhou, Siming Chen, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

The extensive experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve, especially with the introduction of the RLHF training strategy, and indicates that there is still room for improvement in areas such as model robustness.

Abstract

GPT series models, such as GPT-3, CodeX, InstructGPT, ChatGPT, and so on, have gained considerable attention due to their exceptional natural language processing capabilities. However, despite the abundance of research on the difference in capabilities between GPT series models and fine-tuned models, there has been limited attention given to the evolution of GPT series models' capabilities over time. To conduct a comprehensive analysis of the capabilities of GPT series models, we select six representative models, comprising two GPT-3 series models (i.e., davinci and text-davinci-001) and four GPT-3.5 series models (i.e., code-davinci-002, text-davinci-002, text-davinci-003, and gpt-3.5-turbo). We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets. In particular, we compare the performance and robustness of different models for each task under zero-shot and few-shot scenarios. Our extensive experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve, especially with the introduction of the RLHF training strategy. While this strategy enhances the models' ability to generate human-like responses, it also compromises their ability to solve some tasks. Furthermore, our findings indicate that there is still room for improvement in areas such as model robustness.

A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models

TL;DR

The extensive experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve, especially with the introduction of the RLHF training strategy, and indicates that there is still room for improvement in areas such as model robustness.

Abstract

GPT series models, such as GPT-3, CodeX, InstructGPT, ChatGPT, and so on, have gained considerable attention due to their exceptional natural language processing capabilities. However, despite the abundance of research on the difference in capabilities between GPT series models and fine-tuned models, there has been limited attention given to the evolution of GPT series models' capabilities over time. To conduct a comprehensive analysis of the capabilities of GPT series models, we select six representative models, comprising two GPT-3 series models (i.e., davinci and text-davinci-001) and four GPT-3.5 series models (i.e., code-davinci-002, text-davinci-002, text-davinci-003, and gpt-3.5-turbo). We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets. In particular, we compare the performance and robustness of different models for each task under zero-shot and few-shot scenarios. Our extensive experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve, especially with the introduction of the RLHF training strategy. While this strategy enhances the models' ability to generate human-like responses, it also compromises their ability to solve some tasks. Furthermore, our findings indicate that there is still room for improvement in areas such as model robustness.
Paper Structure (28 sections, 5 figures, 46 tables)

This paper contains 28 sections, 5 figures, 46 tables.

Figures (5)

  • Figure 1: The evolutionary relationship of the GPT series models. FeedME and PPO are two distinct training strategies officially described by OpenAI. A dashed arrow ($-\rightarrow$ ) is used between GPT-3 and GPT-3.5 since the official documentation does not provide specific information on the differences between the two series when trained.
  • Figure 2: The performance of different models in zero-shot scenario. Missing bars in some datasets mean that the model cannot perform the specified task on that dataset. See Appendix A. 1 for specific data.
  • Figure 3: The analyzability rates of davinci's performance on different datasets in both zero-shot and three-shot scenarios, with the results ordered based on the ratio of three-shot to zero-shot performance. The details of results are listed in Appendix A.2.
  • Figure 4: The performance of davinci on different datasets in both zero-shot and three-shot scenarios, with the results ordered based on the ratio of three-shot to zero-shot performance. The details of results are listed in Appendix A.2.
  • Figure 5: The analyzability rates of davinci's answer results in the zero-shot scenario. The results are ordered based on the ratio of the analyzability rate when the prompt includes the word "Answer" at the end, to the rate when it does not. The details of results are listed in Appendix A.3.