DiSK: A Diffusion Model for Structured Knowledge

Ouail Kitouni; Niklas Nolte; James Hensman; Bhaskar Mitra

DiSK: A Diffusion Model for Structured Knowledge

Ouail Kitouni, Niklas Nolte, James Hensman, Bhaskar Mitra

TL;DR

DiSK proposes a diffusion-based framework for structured knowledge that operates on heterogeneous data types (text, categorical, numerical) and uses hierarchical encodings and per-type decoders to model inter-property relations. It introduces a continuous-time discrete diffusion with absorbing states and a Gaussian Mixture Model for numerics, enabling high-precision predictions and robust imputation for sparse data. Across 15 tabular datasets and specialized domains like nuclear physics, DiSK demonstrates state-of-the-art or competitive performance in data modeling, synthesis, and downstream predictive tasks, while also offering interpretable, human-curatable knowledge representations. The work highlights a pathway to integrating structured knowledge manipulation with diffusion-based generative modeling, with potential extensions to foundation models and knowledge graphs.

Abstract

Structured (dictionary-like) data presents challenges for left-to-right language models, as they can struggle with structured entities for a wide variety of reasons such as formatting and sensitivity to the order in which attributes are presented. Tabular generative models suffer from a different set of limitations such as their lack of flexibility. We introduce Diffusion Models of Structured Knowledge (DiSK) - a new architecture and training approach specialized for structured data. DiSK handles text, categorical, and continuous numerical data using a Gaussian mixture model approach, which allows for improved precision when dealing with numbers. It employs diffusion training to model relationships between properties. Experiments demonstrate DiSK's state-of-the-art performance on tabular data modeling, synthesis, and imputation on over 15 datasets across diverse domains. DiSK provides an effective inductive bias for generative modeling and manipulation of structured data. The techniques we propose could open the door to improved knowledge manipulation in future language models.

DiSK: A Diffusion Model for Structured Knowledge

TL;DR

Abstract

Paper Structure (41 sections, 1 theorem, 16 equations, 12 figures, 9 tables)

This paper contains 41 sections, 1 theorem, 16 equations, 12 figures, 9 tables.

Introduction
Related work
Generative Modeling of Structured Entities
Masked Modeling
A Formulation of Diffusion over Heterogenous Data
Forward Process
A simple simulation of the backward diffusion process
Likelihood bound
A Continuous Relaxation of Discrete State Diffusion
DiSK Architecture
Hierarchical positional encoding
Encoding
Entity encoding and decoding to property values
Revisiting Numerical Properties
Experiments
...and 26 more sections

Key Result

Proposition 3.1

For the reverse diffusion from the fully masked stationary distribution towards $p({\mathbf{x}}_0)$, an upper bound on the model negative loglikelihood $\mathbb{E}_{p(x)}[-\log p_0^\theta(x)]$ can be given by where $\psi(\tilde{{\mathbf{x}}}) = \sum_{\mathbf{x}} q_t({\mathbf{x}})r_t(\tilde{{\mathbf{x}}} | {\mathbf{x}})$ and ${\color{dblue} C(\pi) = D \frac{1-\hat{\pi}}{1-\pi} \frac{1}{N_t+1}}$.

Figures (12)

Figure 1: The next-token prediction objective retrieves information in only one direction, whereas DiSK models structured data with order-invariant training (see Section \ref{['sec:diffusion_over_hetero']}).
Figure 2: Hierarchical representations of entities can model rich relationships that can be difficult to capture with dense tabular representations, which can be prohibitive for sparse KBs. Black squares correspond to non-existing values, making the table sparse.
Figure 3: Generating samples with keys "property $j$" using masked modeling in one step (left) and autoregressively (right), in which case property values are unmasked in random order.
Figure 4: The DiSK architecture. Keys from the input entity are used by an RNN to generate hierarchical encodings (left). They are then added to encoded values (right) the result is processed by an encoder and type-specific decoders output logits and GMM parameters. Dashed lines are not computed i.e. masked values are not encoded and unmasked values are not predicted.
Figure 5: Generated samples using a DiSK with GMM likelihood. The GMM uses 256 components (left) or 1 component (middle) which is equivalent to MSE when we fix the variance to unity. (right) A histogram of the data with the DiSK learned marginals.
...and 7 more figures

Theorems & Definitions (1)

Proposition 3.1

DiSK: A Diffusion Model for Structured Knowledge

TL;DR

Abstract

DiSK: A Diffusion Model for Structured Knowledge

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (1)