Table of Contents
Fetching ...

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Xuanfan Ni, Piji Li

TL;DR

This work presents a comprehensive, multilingual evaluation of large language models on natural language generation tasks, focusing on dialogue generation and text summarization in English and Chinese. By standardizing input templates, decoding hyperparameters, and post-processing, the study provides a fair comparison across ChatGPT, ChatGLM, T5-based, LLaMA-based, and Pythia-based families. Key findings show encoder-decoder variants like Flan-T5-XXL and FastChat-T5 excel at instruction following, Alpaca-Lora and Vicuna offer diverse outputs, and ChatGPT often achieves the strongest overall performance with scale. The results underscore the importance of instruction tuning and parameter-efficient fine-tuning (e.g., LoRA, P-Tuning v2) for improving NLG capabilities while highlighting remaining challenges in long-form generation and cross-lingual transfer.

Abstract

Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

TL;DR

This work presents a comprehensive, multilingual evaluation of large language models on natural language generation tasks, focusing on dialogue generation and text summarization in English and Chinese. By standardizing input templates, decoding hyperparameters, and post-processing, the study provides a fair comparison across ChatGPT, ChatGLM, T5-based, LLaMA-based, and Pythia-based families. Key findings show encoder-decoder variants like Flan-T5-XXL and FastChat-T5 excel at instruction following, Alpaca-Lora and Vicuna offer diverse outputs, and ChatGPT often achieves the strongest overall performance with scale. The results underscore the importance of instruction tuning and parameter-efficient fine-tuning (e.g., LoRA, P-Tuning v2) for improving NLG capabilities while highlighting remaining challenges in long-form generation and cross-lingual transfer.

Abstract

Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.
Paper Structure (33 sections, 1 equation, 1 figure, 11 tables)

This paper contains 33 sections, 1 equation, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Input templates for English (left) and Chinese (right) datasets. instruction and text will be replaced with content corresponding different datasets.