Synthetic Data Augmentation for Cross-domain Implicit Discourse Relation Recognition
Frances Yung, Varsha Suresh, Zaynab Reza, Mansoor Ahmad, Vera Demberg
TL;DR
This paper investigates whether large language model–generated synthetic data can improve cross-domainImplicit Discourse Relation Recognition (IDRR) when target-domain labeled data is unavailable. It constructs synthetic Arg2 continuations conditioned on DR labels using multiple LLMs and screening methods, then adapts a PDTB-trained RoBERTa model to DiscoGeM 1.5 target domains via domain-specific, domain-mixed, or PDTB+DG_syn configurations. The key finding is that synthetic data augmentation does not yield consistent improvements over the baseline or pseudo-labeling, with results exhibiting high variance and some screening strategies performing worse. These results emphasize the importance of rigorous statistical evaluation, data quality, and the need for annotated cross-domain resources to effectively guide IDRR adaptation in practice.
Abstract
Implicit discourse relation recognition (IDRR) -- the task of identifying the implicit coherence relation between two text spans -- requires deep semantic understanding. Recent studies have shown that zero- or few-shot approaches significantly lag behind supervised models, but LLMs may be useful for synthetic data augmentation, where LLMs generate a second argument following a specified coherence relation. We applied this approach in a cross-domain setting, generating discourse continuations using unlabelled target-domain data to adapt a base model which was trained on source-domain labelled data. Evaluations conducted on a large-scale test set revealed that different variations of the approach did not result in any significant improvements. We conclude that LLMs often fail to generate useful samples for IDRR, and emphasize the importance of considering both statistical significance and comparability when evaluating IDRR models.
