Table of Contents
Fetching ...

Empirical Analysis of Dialogue Relation Extraction with Large Language Models

Guozheng Li, Zijie Xu, Ziyu Shang, Jiajun Liu, Ke Ji, Yikai Guo

TL;DR

This work tackles the challenging task of Dialogue Relation Extraction (DRE), where relations must be inferred across multi-turn dialogues with sparse, pronoun-rich content. It evaluates generation-based large language models, notably ChatGPT and open-source Landre, across full and partial dialogues, demonstrating that scaling and prompting strategies substantially improve DRE performance. Key findings show that indirect extraction with ChatGPT and prompt-tuned Landre achieve competitive or state-of-the-art results compared to traditional sequence- and graph-based methods, with reduced sensitivity to dialogue length. The study introduces Landre, an open-source, LoRA-enabled DRE framework, and provides extensive analyses on error patterns, utterance-length effects, and cross-domain applicability to emotion recognition in conversations, highlighting the practical potential of generative LLMs for DRE research and applications.

Abstract

Dialogue relation extraction (DRE) aims to extract relations between two arguments within a dialogue, which is more challenging than standard RE due to the higher person pronoun frequency and lower information density in dialogues. However, existing DRE methods still suffer from two serious issues: (1) hard to capture long and sparse multi-turn information, and (2) struggle to extract golden relations based on partial dialogues, which motivates us to discover more effective methods that can alleviate the above issues. We notice that the rise of large language models (LLMs) has sparked considerable interest in evaluating their performance across diverse tasks. To this end, we initially investigate the capabilities of different LLMs in DRE, considering both proprietary models and open-source models. Interestingly, we discover that LLMs significantly alleviate two issues in existing DRE methods. Generally, we have following findings: (1) scaling up model size substantially boosts the overall DRE performance and achieves exceptional results, tackling the difficulty of capturing long and sparse multi-turn information; (2) LLMs encounter with much smaller performance drop from entire dialogue setting to partial dialogue setting compared to existing methods; (3) LLMs deliver competitive or superior performances under both full-shot and few-shot settings compared to current state-of-the-art; (4) LLMs show modest performances on inverse relations but much stronger improvements on general relations, and they can handle dialogues of various lengths especially for longer sequences.

Empirical Analysis of Dialogue Relation Extraction with Large Language Models

TL;DR

This work tackles the challenging task of Dialogue Relation Extraction (DRE), where relations must be inferred across multi-turn dialogues with sparse, pronoun-rich content. It evaluates generation-based large language models, notably ChatGPT and open-source Landre, across full and partial dialogues, demonstrating that scaling and prompting strategies substantially improve DRE performance. Key findings show that indirect extraction with ChatGPT and prompt-tuned Landre achieve competitive or state-of-the-art results compared to traditional sequence- and graph-based methods, with reduced sensitivity to dialogue length. The study introduces Landre, an open-source, LoRA-enabled DRE framework, and provides extensive analyses on error patterns, utterance-length effects, and cross-domain applicability to emotion recognition in conversations, highlighting the practical potential of generative LLMs for DRE research and applications.

Abstract

Dialogue relation extraction (DRE) aims to extract relations between two arguments within a dialogue, which is more challenging than standard RE due to the higher person pronoun frequency and lower information density in dialogues. However, existing DRE methods still suffer from two serious issues: (1) hard to capture long and sparse multi-turn information, and (2) struggle to extract golden relations based on partial dialogues, which motivates us to discover more effective methods that can alleviate the above issues. We notice that the rise of large language models (LLMs) has sparked considerable interest in evaluating their performance across diverse tasks. To this end, we initially investigate the capabilities of different LLMs in DRE, considering both proprietary models and open-source models. Interestingly, we discover that LLMs significantly alleviate two issues in existing DRE methods. Generally, we have following findings: (1) scaling up model size substantially boosts the overall DRE performance and achieves exceptional results, tackling the difficulty of capturing long and sparse multi-turn information; (2) LLMs encounter with much smaller performance drop from entire dialogue setting to partial dialogue setting compared to existing methods; (3) LLMs deliver competitive or superior performances under both full-shot and few-shot settings compared to current state-of-the-art; (4) LLMs show modest performances on inverse relations but much stronger improvements on general relations, and they can handle dialogues of various lengths especially for longer sequences.
Paper Structure (32 sections, 3 equations, 4 figures, 6 tables)

This paper contains 32 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An example dialogue with its desired relations. S1-S6: anonymized speaker of each utterance.
  • Figure 2: Formats of four prompting. The outputs of LLMs are highlighted with underline.
  • Figure 3: Paradigm of the Landre framework. In the first step, we construct the prompt tuning data from the original DRE dataset. Then we utilize the parameter-efficient fine-tuning technique LoRA to train the foundation model. The purple, blue and green texts in the above exemplars refer to dialogue context $D$, argument pair $(a_1, a_2)$ and relation set $R$, respectively.
  • Figure 4: Analysis of robustness of Landre tackling increasing utterance length compared to baseline TUCORE and HiDialog.