Long Dialog Summarization: An Analysis
Ankan Mullick, Ayan Kumar Bhowmick, Raghav R, Ravi Kokku, Prasenjit Dey, Pawan Goyal, Niloy Ganguly
TL;DR
This paper addresses long dialog summarization by proposing rubric-driven fine-tuning of large language models to produce context-driven or objective-driven summaries across domains such as shopping interactions and customer support. It evaluates multiple architectures (Longformer, T5, Flan-T5, BART, and ChatGPT) and input strategies on two benchmarks, QmSum and SumScreen, highlighting that no single model universally dominates and that domain- and task-specific rubric guidance is crucial. Ground-truth comparisons using standard metrics (BLEU, ROUGE, METEOR, BERTScore) plus an IEC coverage measure show model strengths vary with dataset and prompting strategy, with Longformer often leading on content overlap and ChatGPT benefiting from explicit prompts. The work emphasizes the practical need for context- and goal-oriented summarization, suggests ensemble or rubric-based approaches, and outlines future work to extend to multilingual and multimodal settings while addressing ethical considerations.
Abstract
Dialog summarization has become increasingly important in managing and comprehending large-scale conversations across various domains. This task presents unique challenges in capturing the key points, context, and nuances of multi-turn long conversations for summarization. It is worth noting that the summarization techniques may vary based on specific requirements such as in a shopping-chatbot scenario, the dialog summary helps to learn user preferences, whereas in the case of a customer call center, the summary may involve the problem attributes that a user specified, and the final resolution provided. This work emphasizes the significance of creating coherent and contextually rich summaries for effective communication in various applications. We explore current state-of-the-art approaches for long dialog summarization in different domains and benchmark metrics based evaluations show that one single model does not perform well across various areas for distinct summarization tasks.
