Table of Contents
Fetching ...

GPTEval: A Survey on Assessments of ChatGPT and GPT-4

Rui Mao, Guanyi Chen, Xulang Zhang, Frank Guerin, Erik Cambria

TL;DR

This survey synthesizes assessments of ChatGPT and GPT-4 across language, reasoning, scientific knowledge, and ethics, highlighting strong language abilities alongside gaps in domain-specific knowledge and challenges in evaluation methodology. It critiques reliance on prompts, benchmarks, and data leakage, and documents heterogeneous results across tasks, languages, and domains. The authors propose task-agnostic evaluation, continued foundational NLP research, and regulatory considerations for AI-generated content as key directions. The work emphasizes how performance varies by domain and prompts, and it outlines practical implications for benchmarking and governance of future LLMs.

Abstract

The emergence of ChatGPT has generated much speculation in the press about its potential to disrupt social and economic systems. Its astonishing language ability has aroused strong curiosity among scholars about its performance in different domains. There have been many studies evaluating the ability of ChatGPT and GPT-4 in different tasks and disciplines. However, a comprehensive review summarizing the collective assessment findings is lacking. The objective of this survey is to thoroughly analyze prior assessments of ChatGPT and GPT-4, focusing on its language and reasoning abilities, scientific knowledge, and ethical considerations. Furthermore, an examination of the existing evaluation methods is conducted, offering several recommendations for future research in evaluating large language models.

GPTEval: A Survey on Assessments of ChatGPT and GPT-4

TL;DR

This survey synthesizes assessments of ChatGPT and GPT-4 across language, reasoning, scientific knowledge, and ethics, highlighting strong language abilities alongside gaps in domain-specific knowledge and challenges in evaluation methodology. It critiques reliance on prompts, benchmarks, and data leakage, and documents heterogeneous results across tasks, languages, and domains. The authors propose task-agnostic evaluation, continued foundational NLP research, and regulatory considerations for AI-generated content as key directions. The work emphasizes how performance varies by domain and prompts, and it outlines practical implications for benchmarking and governance of future LLMs.

Abstract

The emergence of ChatGPT has generated much speculation in the press about its potential to disrupt social and economic systems. Its astonishing language ability has aroused strong curiosity among scholars about its performance in different domains. There have been many studies evaluating the ability of ChatGPT and GPT-4 in different tasks and disciplines. However, a comprehensive review summarizing the collective assessment findings is lacking. The objective of this survey is to thoroughly analyze prior assessments of ChatGPT and GPT-4, focusing on its language and reasoning abilities, scientific knowledge, and ethical considerations. Furthermore, an examination of the existing evaluation methods is conducted, offering several recommendations for future research in evaluating large language models.
Paper Structure (17 sections, 1 figure, 9 tables)

This paper contains 17 sections, 1 figure, 9 tables.

Figures (1)

  • Figure 1: The concept mapping patterns between humans (left) and ChatGPT (right) from mao2024comparative. Each cluster on the left represents target concepts, while on the right, the cluster represents source concepts. Bright and grey dots denote activated and unactivated concepts, respectively. The capitalized terms represent key activated concepts within a cluster.