Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals
Yongxin Zhou, Fabien Ringeval, François Portet
TL;DR
This study systematically evaluates whether GPT models (ChatGPT, GPT-4, GPT-4o) can generate dialogue summaries that adhere to human summarization guidelines, using DialogSum (English) and DECODA (French) as target datasets. It explores a spectrum of prompts (WordLimit, HG, HGR, and HG(R)→WL) and evaluates outputs with ROUGE, BERTScore, LLM-based judgments, and human assessments across Faithfulness, Main Issues, Sub-Issues, and Resolution. The findings show GPTs often outperform task-specific baselines on human-guideline alignment yet lag on automatic metrics, with length and entity-preservation issues in some cases; the two-step prompting approach (HGR→WL) partially mitigates these gaps. The results underscore the need for evaluation metrics that align with concrete guidelines and targeted communication goals, and they point to practical implications for applying LLMs in goal-directed dialogue summarization and evaluation. Overall, the work highlights both the promise and limitations of using LLMs for guideline-driven summarization and suggests concrete directions for improving evaluation methods and prompting strategies.
Abstract
This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models' ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics.
