Table of Contents
Fetching ...

VaeDiff-DocRE: End-to-end Data Augmentation Framework for Document-level Relation Extraction

Khai Phan Tran, Wen Hua, Xue Li

TL;DR

This paper tackles the long-tail problem in document-level relation extraction by proposing VaeDiff-DocRE, an embedding-space data augmentation framework. It combines an Entity Pair VAE (EP-VAE) with a diffusion-based prior to model relation-wise distributions in the latent space, enabling targeted generation of minority-relations representations. A three-stage hierarchical training pipeline learns relation distributions, trains the augmentation module, and integrates augmented data into the DocRE model, yielding improved F1 scores on Re-DocRED and DWIE, especially for rare relations. The approach demonstrates significant gains over state-of-the-art baselines and introduces diffusion priors as a novel tool for multi-label DocRE augmentation, with practical implications for handling real-world imbalanced data in knowledge extraction tasks.

Abstract

Document-level Relation Extraction (DocRE) aims to identify relationships between entity pairs within a document. However, most existing methods assume a uniform label distribution, resulting in suboptimal performance on real-world, imbalanced datasets. To tackle this challenge, we propose a novel data augmentation approach using generative models to enhance data from the embedding space. Our method leverages the Variational Autoencoder (VAE) architecture to capture all relation-wise distributions formed by entity pair representations and augment data for underrepresented relations. To better capture the multi-label nature of DocRE, we parameterize the VAE's latent space with a Diffusion Model. Additionally, we introduce a hierarchical training framework to integrate the proposed VAE-based augmentation module into DocRE systems. Experiments on two benchmark datasets demonstrate that our method outperforms state-of-the-art models, effectively addressing the long-tail distribution problem in DocRE.

VaeDiff-DocRE: End-to-end Data Augmentation Framework for Document-level Relation Extraction

TL;DR

This paper tackles the long-tail problem in document-level relation extraction by proposing VaeDiff-DocRE, an embedding-space data augmentation framework. It combines an Entity Pair VAE (EP-VAE) with a diffusion-based prior to model relation-wise distributions in the latent space, enabling targeted generation of minority-relations representations. A three-stage hierarchical training pipeline learns relation distributions, trains the augmentation module, and integrates augmented data into the DocRE model, yielding improved F1 scores on Re-DocRED and DWIE, especially for rare relations. The approach demonstrates significant gains over state-of-the-art baselines and introduces diffusion priors as a novel tool for multi-label DocRE augmentation, with practical implications for handling real-world imbalanced data in knowledge extraction tasks.

Abstract

Document-level Relation Extraction (DocRE) aims to identify relationships between entity pairs within a document. However, most existing methods assume a uniform label distribution, resulting in suboptimal performance on real-world, imbalanced datasets. To tackle this challenge, we propose a novel data augmentation approach using generative models to enhance data from the embedding space. Our method leverages the Variational Autoencoder (VAE) architecture to capture all relation-wise distributions formed by entity pair representations and augment data for underrepresented relations. To better capture the multi-label nature of DocRE, we parameterize the VAE's latent space with a Diffusion Model. Additionally, we introduce a hierarchical training framework to integrate the proposed VAE-based augmentation module into DocRE systems. Experiments on two benchmark datasets demonstrate that our method outperforms state-of-the-art models, effectively addressing the long-tail distribution problem in DocRE.

Paper Structure

This paper contains 43 sections, 31 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Relation frequency in Re-DocRED tan2022revisiting and DWIE zaporojets2021dwie datasets.
  • Figure 2: Relation-wise distribution from Re-DocRED dataset visualized by t-SNE van2008visualizing.
  • Figure 3: Structure of EP-VAE module.
  • Figure 4: Overview of our VaeDiff-DocRE framework.
  • Figure 5: t-SNE visualization of relation-wise distributions of encoded and generated Entity Pair Representations by the trained VaeDiff module ($\S$\ref{['sec-vaediff']}). Each distribution is depicted in a unique color. represents the encoded actual entity pair representations within the document and $\times$ denotes the representations generated by the module.