Table of Contents
Fetching ...

A Survey on the Real Power of ChatGPT

Ming Liu, Ran Liu, Ye Zhu, Hua Wang, Youyang Qu, Rongsheng Li, Yongpan Sheng, Wray Buntine

TL;DR

This survey analyzes how ChatGPT performs across seven NLP task categories and examines social implications and safety issues. It synthesizes findings that zero-shot and few-shot capabilities are strong but generally lag fine-tuned models, and that generalization to new data is limited and time-varying. It also highlights methodological challenges, notably heavy reliance on prompt engineering and potential data contamination in leaderboards, and outlines key opportunities in explainability, continual learning, and lightweight modeling. The work provides a concise, evidence-based perspective to guide robust evaluation and responsible deployment of ChatGPT-like systems.

Abstract

ChatGPT has changed the AI community and an active research line is the performance evaluation of ChatGPT. A key challenge for the evaluation is that ChatGPT is still closed-source and traditional benchmark datasets may have been used by ChatGPT as the training data. In this paper, (i) we survey recent studies which uncover the real performance levels of ChatGPT in seven categories of NLP tasks, (ii) review the social implications and safety issues of ChatGPT, and (iii) emphasize key challenges and opportunities for its evaluation. We hope our survey can shed some light on its blackbox manner, so that researchers are not misleaded by its surface generation.

A Survey on the Real Power of ChatGPT

TL;DR

This survey analyzes how ChatGPT performs across seven NLP task categories and examines social implications and safety issues. It synthesizes findings that zero-shot and few-shot capabilities are strong but generally lag fine-tuned models, and that generalization to new data is limited and time-varying. It also highlights methodological challenges, notably heavy reliance on prompt engineering and potential data contamination in leaderboards, and outlines key opportunities in explainability, continual learning, and lightweight modeling. The work provides a concise, evidence-based perspective to guide robust evaluation and responsible deployment of ChatGPT-like systems.

Abstract

ChatGPT has changed the AI community and an active research line is the performance evaluation of ChatGPT. A key challenge for the evaluation is that ChatGPT is still closed-source and traditional benchmark datasets may have been used by ChatGPT as the training data. In this paper, (i) we survey recent studies which uncover the real performance levels of ChatGPT in seven categories of NLP tasks, (ii) review the social implications and safety issues of ChatGPT, and (iii) emphasize key challenges and opportunities for its evaluation. We hope our survey can shed some light on its blackbox manner, so that researchers are not misleaded by its surface generation.
Paper Structure (29 sections, 2 tables)