ChatGPT vs Human-authored Text: Insights into Controllable Text Summarization and Sentence Style Transfer
Dongqi Liu, Vera Demberg
TL;DR
This work systematically examines ChatGPT’s ability in two controllable text-generation tasks: audience-specific summarization and sentence formality transfer, comparing outputs to human-authored text. Using zero-shot prompts on the ELIFE and GYAFC datasets, the study analyzes readability, content fidelity, and stylistic differences via metrics like ROUGE, SummaC, BLEU, and POS/dependency distributions, plus hallucination checks. Key findings show humans exhibit larger stylistic variation than ChatGPT, and ChatGPT’s outputs can deviate from source semantics and exhibit hallucinations, though prompt engineering and example-guided prompts can improve alignment somewhat. The work highlights practical implications for deploying LLMs in controllable writing tasks, underscoring reliability concerns and the value of guided prompts to offset gaps with human performance.
Abstract
Large-scale language models, like ChatGPT, have garnered significant media attention and stunned the public with their remarkable capacity for generating coherent text from short natural language prompts. In this paper, we aim to conduct a systematic inspection of ChatGPT's performance in two controllable generation tasks, with respect to ChatGPT's ability to adapt its output to different target audiences (expert vs. layman) and writing styles (formal vs. informal). Additionally, we evaluate the faithfulness of the generated text, and compare the model's performance with human-authored texts. Our findings indicate that the stylistic variations produced by humans are considerably larger than those demonstrated by ChatGPT, and the generated texts diverge from human samples in several characteristics, such as the distribution of word types. Moreover, we observe that ChatGPT sometimes incorporates factual errors or hallucinations when adapting the text to suit a specific style.
