VaeDiff-DocRE: End-to-end Data Augmentation Framework for Document-level Relation Extraction
Khai Phan Tran, Wen Hua, Xue Li
TL;DR
This paper tackles the long-tail problem in document-level relation extraction by proposing VaeDiff-DocRE, an embedding-space data augmentation framework. It combines an Entity Pair VAE (EP-VAE) with a diffusion-based prior to model relation-wise distributions in the latent space, enabling targeted generation of minority-relations representations. A three-stage hierarchical training pipeline learns relation distributions, trains the augmentation module, and integrates augmented data into the DocRE model, yielding improved F1 scores on Re-DocRED and DWIE, especially for rare relations. The approach demonstrates significant gains over state-of-the-art baselines and introduces diffusion priors as a novel tool for multi-label DocRE augmentation, with practical implications for handling real-world imbalanced data in knowledge extraction tasks.
Abstract
Document-level Relation Extraction (DocRE) aims to identify relationships between entity pairs within a document. However, most existing methods assume a uniform label distribution, resulting in suboptimal performance on real-world, imbalanced datasets. To tackle this challenge, we propose a novel data augmentation approach using generative models to enhance data from the embedding space. Our method leverages the Variational Autoencoder (VAE) architecture to capture all relation-wise distributions formed by entity pair representations and augment data for underrepresented relations. To better capture the multi-label nature of DocRE, we parameterize the VAE's latent space with a Diffusion Model. Additionally, we introduce a hierarchical training framework to integrate the proposed VAE-based augmentation module into DocRE systems. Experiments on two benchmark datasets demonstrate that our method outperforms state-of-the-art models, effectively addressing the long-tail distribution problem in DocRE.
