Table of Contents
Fetching ...

ChatGPT vs Human-authored Text: Insights into Controllable Text Summarization and Sentence Style Transfer

Dongqi Liu, Vera Demberg

TL;DR

This work systematically examines ChatGPT’s ability in two controllable text-generation tasks: audience-specific summarization and sentence formality transfer, comparing outputs to human-authored text. Using zero-shot prompts on the ELIFE and GYAFC datasets, the study analyzes readability, content fidelity, and stylistic differences via metrics like ROUGE, SummaC, BLEU, and POS/dependency distributions, plus hallucination checks. Key findings show humans exhibit larger stylistic variation than ChatGPT, and ChatGPT’s outputs can deviate from source semantics and exhibit hallucinations, though prompt engineering and example-guided prompts can improve alignment somewhat. The work highlights practical implications for deploying LLMs in controllable writing tasks, underscoring reliability concerns and the value of guided prompts to offset gaps with human performance.

Abstract

Large-scale language models, like ChatGPT, have garnered significant media attention and stunned the public with their remarkable capacity for generating coherent text from short natural language prompts. In this paper, we aim to conduct a systematic inspection of ChatGPT's performance in two controllable generation tasks, with respect to ChatGPT's ability to adapt its output to different target audiences (expert vs. layman) and writing styles (formal vs. informal). Additionally, we evaluate the faithfulness of the generated text, and compare the model's performance with human-authored texts. Our findings indicate that the stylistic variations produced by humans are considerably larger than those demonstrated by ChatGPT, and the generated texts diverge from human samples in several characteristics, such as the distribution of word types. Moreover, we observe that ChatGPT sometimes incorporates factual errors or hallucinations when adapting the text to suit a specific style.

ChatGPT vs Human-authored Text: Insights into Controllable Text Summarization and Sentence Style Transfer

TL;DR

This work systematically examines ChatGPT’s ability in two controllable text-generation tasks: audience-specific summarization and sentence formality transfer, comparing outputs to human-authored text. Using zero-shot prompts on the ELIFE and GYAFC datasets, the study analyzes readability, content fidelity, and stylistic differences via metrics like ROUGE, SummaC, BLEU, and POS/dependency distributions, plus hallucination checks. Key findings show humans exhibit larger stylistic variation than ChatGPT, and ChatGPT’s outputs can deviate from source semantics and exhibit hallucinations, though prompt engineering and example-guided prompts can improve alignment somewhat. The work highlights practical implications for deploying LLMs in controllable writing tasks, underscoring reliability concerns and the value of guided prompts to offset gaps with human performance.

Abstract

Large-scale language models, like ChatGPT, have garnered significant media attention and stunned the public with their remarkable capacity for generating coherent text from short natural language prompts. In this paper, we aim to conduct a systematic inspection of ChatGPT's performance in two controllable generation tasks, with respect to ChatGPT's ability to adapt its output to different target audiences (expert vs. layman) and writing styles (formal vs. informal). Additionally, we evaluate the faithfulness of the generated text, and compare the model's performance with human-authored texts. Our findings indicate that the stylistic variations produced by humans are considerably larger than those demonstrated by ChatGPT, and the generated texts diverge from human samples in several characteristics, such as the distribution of word types. Moreover, we observe that ChatGPT sometimes incorporates factual errors or hallucinations when adapting the text to suit a specific style.
Paper Structure (38 sections, 15 figures, 9 tables)

This paper contains 38 sections, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Comparison of abstractiveness between ChatGPT and human-generated summaries
  • Figure 2: Summary consistency detection. L stands for layman, E for expert.
  • Figure 3: Absolute differences in POS tags distribution of ChatGPT and human-generated sentences: GYAFC - EM
  • Figure 4: Dependency arc entailment: GYAFC - EM. Data points$>$0.95$\approx$Accurate. To clarify discrepancies, cutoff point$=$0.95.
  • Figure 5: Absolute differences in dependency labels distribution of ChatGPT and human-generated formal style sentences: GYAFC - EM
  • ...and 10 more figures