Table of Contents
Fetching ...

Graph Structure Inference with BAM: Introducing the Bilinear Attention Mechanism

Philipp Froehlich, Heinz Koeppl

TL;DR

Graph structure inference from observational data is framed as a supervised learning problem using covariance information processed on the SPD manifold. The Bilinear Attention Mechanism (BAM) combines channel embeddings, observational self-attention, SPD-based bilinear attention, and a Log-Eig mapping, with training data generated from SEMs using random Chebyshev polynomials to cover diverse dependencies. BAM delivers robust undirected graph recovery and competitive CPDAG estimation by first identifying skeleton and moralized edges and then orienting edges via a dedicated CPDAG network informed by Meek rules. The approach demonstrates strong generalization to nonlinear dependencies, offers computational efficiency relative to unsupervised methods, and opens new avenues for SPD-manifold optimization in graph learning.

Abstract

In statistics and machine learning, detecting dependencies in datasets is a central challenge. We propose a novel neural network model for supervised graph structure learning, i.e., the process of learning a mapping between observational data and their underlying dependence structure. The model is trained with variably shaped and coupled simulated input data and requires only a single forward pass through the trained network for inference. By leveraging structural equation models and employing randomly generated multivariate Chebyshev polynomials for the simulation of training data, our method demonstrates robust generalizability across both linear and various types of non-linear dependencies. We introduce a novel bilinear attention mechanism (BAM) for explicit processing of dependency information, which operates on the level of covariance matrices of transformed data and respects the geometry of the manifold of symmetric positive definite matrices. Empirical evaluation demonstrates the robustness of our method in detecting a wide range of dependencies, excelling in undirected graph estimation and proving competitive in completed partially directed acyclic graph estimation through a novel two-step approach.

Graph Structure Inference with BAM: Introducing the Bilinear Attention Mechanism

TL;DR

Graph structure inference from observational data is framed as a supervised learning problem using covariance information processed on the SPD manifold. The Bilinear Attention Mechanism (BAM) combines channel embeddings, observational self-attention, SPD-based bilinear attention, and a Log-Eig mapping, with training data generated from SEMs using random Chebyshev polynomials to cover diverse dependencies. BAM delivers robust undirected graph recovery and competitive CPDAG estimation by first identifying skeleton and moralized edges and then orienting edges via a dedicated CPDAG network informed by Meek rules. The approach demonstrates strong generalization to nonlinear dependencies, offers computational efficiency relative to unsupervised methods, and opens new avenues for SPD-manifold optimization in graph learning.

Abstract

In statistics and machine learning, detecting dependencies in datasets is a central challenge. We propose a novel neural network model for supervised graph structure learning, i.e., the process of learning a mapping between observational data and their underlying dependence structure. The model is trained with variably shaped and coupled simulated input data and requires only a single forward pass through the trained network for inference. By leveraging structural equation models and employing randomly generated multivariate Chebyshev polynomials for the simulation of training data, our method demonstrates robust generalizability across both linear and various types of non-linear dependencies. We introduce a novel bilinear attention mechanism (BAM) for explicit processing of dependency information, which operates on the level of covariance matrices of transformed data and respects the geometry of the manifold of symmetric positive definite matrices. Empirical evaluation demonstrates the robustness of our method in detecting a wide range of dependencies, excelling in undirected graph estimation and proving competitive in completed partially directed acyclic graph estimation through a novel two-step approach.
Paper Structure (49 sections, 3 theorems, 30 equations, 11 figures, 3 tables)

This paper contains 49 sections, 3 theorems, 30 equations, 11 figures, 3 tables.

Key Result

Theorem 1

For any $\boldsymbol{S}\in\mathcal{S}^{d\times d}_{\succeq}$, the largest eigenvalue of $\widetilde{\boldsymbol{\sigma}}(\boldsymbol{S})$ is $1$.

Figures (11)

  • Figure 1: Neural network architecture: An input of arbitrarily shape $(M,d)$ is provided, which is then embedded into $C$ channels. Attention between attributes and attention between datapoints are applied alternately. Covariance matrices are calculated, followed by alternating applications of bilinear attention and the custom activation function in the Riemannian manifold of spd matrices. The matrices are then transformed into Euclidean space using the $\operatorname{Log-Eig}$ layer. Output probabilities for each pair of variables being in the classes "no edge", "skeleton edge", and "moralized edge" are calculated using dense layers along the channel axis and applying a softmax layer on the channel axis.
  • Figure 2: Scatterplots illustrating example non-linear dependencies governed by the structural equation model employed in this study.
  • Figure 3: Bilinear self-attention layer. Gray indicates non-trainable tensors, and red trainable weights. Matrix multiplication is performed after necessary transposition to match axis dimensions. The double arrow signifies the use of the matrix as a bilinear operator. $\widetilde{\sigma}$ denotes the custom softmax, defined in \ref{['spdmax']}.
  • Figure 4: Undirected graph estimation results, arranged in each case from worst (left) to best (right). AUC values. (a) and (b) AUC values for different dependencies, with (a) $d=50$, $M=200$ and (b) $d=100$, $M=50$. (c) shows accuracy values for the same dependencies at $d=100$, $M=50$. (d) and (e) present AUC values for different sample sizes with Chebyshev (d) and cosine (e) dependencies at $d=100$. (f) displays structural Hamming distance for varying sample sizes in a high-dimensional setting ($d=100$) for Chebyshev dependency.
  • Figure 5: CPDAG estimation results ordered from worst (left) to best (right). (a)-(c) SHD for various dependencies at (a) $d=20$, $M=200$, (b) $d=50$, $M=200$, and (c) $d=100$, $M=500$. (d) AUC for Chebyshev dependency across $M=50, 100, 200, 500, 1000$ at $d=100$. (e)+(f) SHD at $d=100$ for Chebyshev and sine dependencies, respectively, over the same $M$ values.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Proposition 2
  • Theorem 3