Table of Contents
Fetching ...

Graph-Conditional Flow Matching for Relational Data Generation

Davide Scassola, Sebastiano Saccani, Luca Bortolussi

TL;DR

The paper tackles the challenge of privacy-preserving relational data synthesis by introducing a graph-conditioned flow matching framework. It models the entire relational dataset as a graph and learns p( X | G ) through a denoiser that integrates a graph neural network, enabling long-range dependencies across records and parallel, table-wise generation. The approach achieves state-of-the-art fidelity on multiple relational benchmarks and demonstrates negligible privacy leakage, underscoring its practical viability for safe data sharing. The method is modular, scalable, and adaptable to complex schemas, with strong potential for extension to larger graphs and diffusion-inspired variants.

Abstract

Data synthesis is gaining momentum as a privacy-enhancing technology. While single-table tabular data generation has seen considerable progress, current methods for multi-table data often lack the flexibility and expressiveness needed to capture complex relational structures. In particular, they struggle with long-range dependencies and complex foreign-key relationships, such as tables with multiple parent tables or multiple types of links between the same pair of tables. We propose a generative model for relational data that generates the content of a relational dataset given the graph formed by the foreign-key relationships. We do this by learning a deep generative model of the content of the whole relational database by flow matching, where the neural network trained to denoise records leverages a graph neural network to obtain information from connected records. Our method is flexible, as it can support relational datasets with complex structures, and expressive, as the generation of each record can be influenced by any other record within the same connected component. We evaluate our method on several benchmark datasets and show that it achieves state-of-the-art performance in terms of synthetic data fidelity.

Graph-Conditional Flow Matching for Relational Data Generation

TL;DR

The paper tackles the challenge of privacy-preserving relational data synthesis by introducing a graph-conditioned flow matching framework. It models the entire relational dataset as a graph and learns p( X | G ) through a denoiser that integrates a graph neural network, enabling long-range dependencies across records and parallel, table-wise generation. The approach achieves state-of-the-art fidelity on multiple relational benchmarks and demonstrates negligible privacy leakage, underscoring its practical viability for safe data sharing. The method is modular, scalable, and adaptable to complex schemas, with strong potential for extension to larger graphs and diffusion-inspired variants.

Abstract

Data synthesis is gaining momentum as a privacy-enhancing technology. While single-table tabular data generation has seen considerable progress, current methods for multi-table data often lack the flexibility and expressiveness needed to capture complex relational structures. In particular, they struggle with long-range dependencies and complex foreign-key relationships, such as tables with multiple parent tables or multiple types of links between the same pair of tables. We propose a generative model for relational data that generates the content of a relational dataset given the graph formed by the foreign-key relationships. We do this by learning a deep generative model of the content of the whole relational database by flow matching, where the neural network trained to denoise records leverages a graph neural network to obtain information from connected records. Our method is flexible, as it can support relational datasets with complex structures, and expressive, as the generation of each record can be influenced by any other record within the same connected component. We evaluate our method on several benchmark datasets and show that it achieves state-of-the-art performance in terms of synthetic data fidelity.

Paper Structure

This paper contains 43 sections, 24 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the architecture of the denoiser for relational data. A relational dataset composed of multiple tables can be seen as a graph, where records are the nodes and foreign keys are the edges. The denoiser takes as input a relational dataset where noise was added to each record with noise level $t$. Firstly, a graph neural network (GNN) processes the entire graph and computes node embeddings $\varepsilon^i_t$ encoding context information for each record. Each record and its corresponding embedding are then processed independently by table-specific multi-layer perceptrons (MLPs), which predict the original clean records ($t=1$).
  • Figure 2: Best validation loss as a function of the GNN embedding size. A value of zero means the GNN is not used.