Table of Contents
Fetching ...

Uncovering the Potential of ChatGPT for Discourse Analysis in Dialogue: An Empirical Study

Yaxin Fan, Feng Jiang, Peifeng Li, Haizhou Li

TL;DR

This study evaluates ChatGPT on two dialogue discourse tasks—topic segmentation and discourse parsing—via a carefully designed prompt framework that separates task description, output format, and structured input. Across four topic-segmentation datasets and two discourse-parsing datasets, ChatGPT shows strong capability for general-domain topic boundaries but struggles with domain-specific topics and hierarchical discourse relations, often producing linear, adjacent links rather than true long-range structures. In-context learning with chain-of-thought can significantly improve hierarchical parsing, and ablation shows output format is the most influential prompt component. The work highlights both the potential and current limits of LLMs for deep dialogue understanding, providing practical guidance for prompt design and paving the way for future discourse-analysis research, with code available at the authors’ GitHub repository.

Abstract

Large language models, like ChatGPT, have shown remarkable capability in many downstream tasks, yet their ability to understand discourse structures of dialogues remains less explored, where it requires higher level capabilities of understanding and reasoning. In this paper, we aim to systematically inspect ChatGPT's performance in two discourse analysis tasks: topic segmentation and discourse parsing, focusing on its deep semantic understanding of linear and hierarchical discourse structures underlying dialogue. To instruct ChatGPT to complete these tasks, we initially craft a prompt template consisting of the task description, output format, and structured input. Then, we conduct experiments on four popular topic segmentation datasets and two discourse parsing datasets. The experimental results showcase that ChatGPT demonstrates proficiency in identifying topic structures in general-domain conversations yet struggles considerably in specific-domain conversations. We also found that ChatGPT hardly understands rhetorical structures that are more complex than topic structures. Our deeper investigation indicates that ChatGPT can give more reasonable topic structures than human annotations but only linearly parses the hierarchical rhetorical structures. In addition, we delve into the impact of in-context learning (e.g., chain-of-thought) on ChatGPT and conduct the ablation study on various prompt components, which can provide a research foundation for future work. The code is available at \url{https://github.com/yxfanSuda/GPTforDDA}.

Uncovering the Potential of ChatGPT for Discourse Analysis in Dialogue: An Empirical Study

TL;DR

This study evaluates ChatGPT on two dialogue discourse tasks—topic segmentation and discourse parsing—via a carefully designed prompt framework that separates task description, output format, and structured input. Across four topic-segmentation datasets and two discourse-parsing datasets, ChatGPT shows strong capability for general-domain topic boundaries but struggles with domain-specific topics and hierarchical discourse relations, often producing linear, adjacent links rather than true long-range structures. In-context learning with chain-of-thought can significantly improve hierarchical parsing, and ablation shows output format is the most influential prompt component. The work highlights both the potential and current limits of LLMs for deep dialogue understanding, providing practical guidance for prompt design and paving the way for future discourse-analysis research, with code available at the authors’ GitHub repository.

Abstract

Large language models, like ChatGPT, have shown remarkable capability in many downstream tasks, yet their ability to understand discourse structures of dialogues remains less explored, where it requires higher level capabilities of understanding and reasoning. In this paper, we aim to systematically inspect ChatGPT's performance in two discourse analysis tasks: topic segmentation and discourse parsing, focusing on its deep semantic understanding of linear and hierarchical discourse structures underlying dialogue. To instruct ChatGPT to complete these tasks, we initially craft a prompt template consisting of the task description, output format, and structured input. Then, we conduct experiments on four popular topic segmentation datasets and two discourse parsing datasets. The experimental results showcase that ChatGPT demonstrates proficiency in identifying topic structures in general-domain conversations yet struggles considerably in specific-domain conversations. We also found that ChatGPT hardly understands rhetorical structures that are more complex than topic structures. Our deeper investigation indicates that ChatGPT can give more reasonable topic structures than human annotations but only linearly parses the hierarchical rhetorical structures. In addition, we delve into the impact of in-context learning (e.g., chain-of-thought) on ChatGPT and conduct the ablation study on various prompt components, which can provide a research foundation for future work. The code is available at \url{https://github.com/yxfanSuda/GPTforDDA}.
Paper Structure (37 sections, 6 figures, 10 tables)

This paper contains 37 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: A dialogue from the STAC asher2016discourse dataset, consisting of seven utterances $U_1$-$U_7$ and three speakers Cat, wil, and Thomas. Dialogue topic segmentation aims to reveal the linear topic structure by dividing the dialogue into several topical pieces and '1' indicates the end of a topic. Dialogue discourse parsing aims to reflect hierarchical rhetorical structure by establishing discourse links of utterance pairs according to discourse relations, where Cont, QAP, and Exp is short for Continuation, Question-answer_pair, and Explanation, respectively.
  • Figure 2: Post-processing for dialogue topic segmentation.
  • Figure 3: Manual pair-wise evaluation between ChatGPT-generated and human-annotated topic structures. win indicates that ChatGPT-generated topic structure is more reasonable, tie indicates that ChatGPT-generated and human-annotated topic structures are equally reasonable, and lose indicates that human-annotated topic structure is more reasonable.
  • Figure 4: The comparison of performance between ChatGPT and baselines on STAC and Molweni at various distances. If there is a link between $U_j$ and $U_i$, the distance of the link is defined as $i-j$.
  • Figure 5: (a) and (b) show the details of in-context learning with one exemplar for dialogue topic segmentation and dialogue discourse parsing, respectively.
  • ...and 1 more figures