Table of Contents
Fetching ...

Prompting Implicit Discourse Relation Annotation

Frances Yung, Mansoor Ahmad, Merel Scholman, Vera Demberg

TL;DR

This work interrogates whether prompting strategies can enable zero-/few-shot implicit DR annotation with GPT-4, focusing on breaking a 14-way DR classification into smaller tasks via two-step DC insertion, per-class binary prompts, and per-class verification prompts. Across PDTB 3.0 and DiscoGeM datasets, GPT-4’s implicit DR predictions show limited gains and remain substantially behind supervised state-of-the-art, even with sophisticated prompt designs. The results indicate that implicit DR recognition may not be solvable under zero-/few-shot settings without explicit supervision or additional signals, though some prompts offer multi-label annotation potential and insights into prompt-dependent behavior. The study highlights the gap between prompt-driven reasoning and the need for task-specific supervision in fine-grained linguistic classification, with implications for crowd-sourced annotation pipelines and future prompt engineering efforts.

Abstract

Pre-trained large language models, such as ChatGPT, archive outstanding performance in various reasoning tasks without supervised training and were found to have outperformed crowdsourcing workers. Nonetheless, ChatGPT's performance in the task of implicit discourse relation classification, prompted by a standard multiple-choice question, is still far from satisfactory and considerably inferior to state-of-the-art supervised approaches. This work investigates several proven prompting techniques to improve ChatGPT's recognition of discourse relations. In particular, we experimented with breaking down the classification task that involves numerous abstract labels into smaller subtasks. Nonetheless, experiment results show that the inference accuracy hardly changes even with sophisticated prompt engineering, suggesting that implicit discourse relation classification is not yet resolvable under zero-shot or few-shot settings.

Prompting Implicit Discourse Relation Annotation

TL;DR

This work interrogates whether prompting strategies can enable zero-/few-shot implicit DR annotation with GPT-4, focusing on breaking a 14-way DR classification into smaller tasks via two-step DC insertion, per-class binary prompts, and per-class verification prompts. Across PDTB 3.0 and DiscoGeM datasets, GPT-4’s implicit DR predictions show limited gains and remain substantially behind supervised state-of-the-art, even with sophisticated prompt designs. The results indicate that implicit DR recognition may not be solvable under zero-/few-shot settings without explicit supervision or additional signals, though some prompts offer multi-label annotation potential and insights into prompt-dependent behavior. The study highlights the gap between prompt-driven reasoning and the need for task-specific supervision in fine-grained linguistic classification, with implications for crowd-sourced annotation pipelines and future prompt engineering efforts.

Abstract

Pre-trained large language models, such as ChatGPT, archive outstanding performance in various reasoning tasks without supervised training and were found to have outperformed crowdsourcing workers. Nonetheless, ChatGPT's performance in the task of implicit discourse relation classification, prompted by a standard multiple-choice question, is still far from satisfactory and considerably inferior to state-of-the-art supervised approaches. This work investigates several proven prompting techniques to improve ChatGPT's recognition of discourse relations. In particular, we experimented with breaking down the classification task that involves numerous abstract labels into smaller subtasks. Nonetheless, experiment results show that the inference accuracy hardly changes even with sophisticated prompt engineering, suggesting that implicit discourse relation classification is not yet resolvable under zero-shot or few-shot settings.
Paper Structure (23 sections, 8 figures, 5 tables)

This paper contains 23 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Confusion matrices comparing the gold and predicted labels in the PDTB 3.0 test set using the MC prompt. The distribution in the left figure is normalized by the predicted class, i.e. the diagonal corresponds to the precision; while the distribution on the right is normalized by the gold class, i.e. the diagonal corresponds to the recall. The percentages in brackets are the overall distributions of the predicted and gold labels respectively.
  • Figure 2: Confusion matrices comparing the gold and predicted labels in the PDTB test set using the per-class verification prompt with the MC aggregation step.
  • Figure 3: Confusion matrices comparing the single gold and predicted labels in the DiscoGeM test set using the MC prompt.
  • Figure 4: Confusion matrices comparing the single gold and predicted labels in the DiscoGeM test set using the per-class verification prompt with the MC aggregation step.
  • Figure 5: MC prompt adapted from chan2023chatgpt
  • ...and 3 more figures