Table of Contents
Fetching ...

A Novel Graph Transformer Framework for Gene Regulatory Network Inference

Binon Teji, Swarup Roy

TL;DR

This work tackles gene regulatory network (GRN) inference from noisy gene-expression data by formulating it as a link-prediction problem. It introduces GT-GRN, a Graph Transformer framework that fuses three information streams: gene-expression embeddings learned via a Variational Autoencoder, global gene embeddings derived from multi-network prior knowledge encoded as text-like sequences processed by a BERT-based model, and graph positional encodings from the input network. The approach demonstrates superior performance on both full network reconstruction and link-prediction tasks across multiple datasets, and shows utility for cell-type annotation through learned gene embeddings. The results underscore the value of multi-modal integration and global context in GRN inference, with potential extensions toward prioritizing disease-relevant genes and regulatory hubs.

Abstract

The inference of gene regulatory networks (GRNs) is a foundational stride towards deciphering the fundamentals of complex biological systems. Inferring a possible regulatory link between two genes can be formulated as a link prediction problem. Inference of GRNs via gene coexpression profiling data may not always reflect true biological interactions, as its susceptibility to noise and misrepresenting true biological regulatory relationships. Most GRN inference methods face several challenges in the network reconstruction phase. Therefore, it is important to encode gene expression values, leverege the prior knowledge gained from the available inferred network structures and positional informations of the input network nodes towards inferring a better and more confident GRN network reconstruction. In this paper, we explore the integration of multiple inferred networks to enhance the inference of Gene Regulatory Networks (GRNs). Primarily, we employ autoencoder embeddings to capture gene expression patterns directly from raw data, preserving intricate biological signals. Then, we embed the prior knowledge from GRN structures transforming them into a text-like representation using random walks, which are then encoded with a masked language model, BERT, to generate global embeddings for each gene across all networks. Additionally, we embed the positional encodings of the input gene networks to better identify the position of each unique gene within the graph. These embeddings are integrated into graph transformer-based model, termed GT-GRN, for GRN inference. The GT-GRN model effectively utilizes the topological structure of the ground truth network while incorporating the enriched encoded information. Experimental results demonstrate that GT-GRN significantly outperforms existing GRN inference methods, achieving superior accuracy and highlighting the robustness of our approach.

A Novel Graph Transformer Framework for Gene Regulatory Network Inference

TL;DR

This work tackles gene regulatory network (GRN) inference from noisy gene-expression data by formulating it as a link-prediction problem. It introduces GT-GRN, a Graph Transformer framework that fuses three information streams: gene-expression embeddings learned via a Variational Autoencoder, global gene embeddings derived from multi-network prior knowledge encoded as text-like sequences processed by a BERT-based model, and graph positional encodings from the input network. The approach demonstrates superior performance on both full network reconstruction and link-prediction tasks across multiple datasets, and shows utility for cell-type annotation through learned gene embeddings. The results underscore the value of multi-modal integration and global context in GRN inference, with potential extensions toward prioritizing disease-relevant genes and regulatory hubs.

Abstract

The inference of gene regulatory networks (GRNs) is a foundational stride towards deciphering the fundamentals of complex biological systems. Inferring a possible regulatory link between two genes can be formulated as a link prediction problem. Inference of GRNs via gene coexpression profiling data may not always reflect true biological interactions, as its susceptibility to noise and misrepresenting true biological regulatory relationships. Most GRN inference methods face several challenges in the network reconstruction phase. Therefore, it is important to encode gene expression values, leverege the prior knowledge gained from the available inferred network structures and positional informations of the input network nodes towards inferring a better and more confident GRN network reconstruction. In this paper, we explore the integration of multiple inferred networks to enhance the inference of Gene Regulatory Networks (GRNs). Primarily, we employ autoencoder embeddings to capture gene expression patterns directly from raw data, preserving intricate biological signals. Then, we embed the prior knowledge from GRN structures transforming them into a text-like representation using random walks, which are then encoded with a masked language model, BERT, to generate global embeddings for each gene across all networks. Additionally, we embed the positional encodings of the input gene networks to better identify the position of each unique gene within the graph. These embeddings are integrated into graph transformer-based model, termed GT-GRN, for GRN inference. The GT-GRN model effectively utilizes the topological structure of the ground truth network while incorporating the enriched encoded information. Experimental results demonstrate that GT-GRN significantly outperforms existing GRN inference methods, achieving superior accuracy and highlighting the robustness of our approach.

Paper Structure

This paper contains 26 sections, 14 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Gene Expression Embeddings via Variational Autoencoder (VAE)
  • Figure 2: Global Gene Embeddings via Multi-Network Integration. Gene expression data is processed through inference algorithms to generate networks. Every network is sampled via random walks to produce node sequences. Each sequence starts with a special [CLS] token. These sequences are tokenized and embedded, then passed to a transformer model resulting in global embedding of each node or gene considering the context of all input networks in a joint learning setting. The model is trained by masking nodes in sequences and predicting them. Final node/gene embeddings are extracted from the embedding layer.
  • Figure 3: Architecture diagram of GT-GRN.Graph Transformer layer operates on the input graph $\mathcal{A}$ with its corresponding features $h$. The input features consists of gene expression embeddings , graph positional encodings and the global gene embeddings. It operates to compute the node embeddings for a particular node $\textbf{a}$ after passing through multiple layers $L$ to produce representations at the next layer $h_a^{\ell+1}$. Link prediction is done via a separate link predictor module that takes two node embeddings say $h_a^{\ell+1}$ and $h_b^{\ell+1}$ to predict a link between them.
  • Figure 4: Full network reconstruction performance of various methods for different datasets in terms AUROC score.(a) BEELINE’s scRNA-seq datasets and (b) GNW’s Yeast dataset.
  • Figure 5: Overall hyper-parameter tuning plot for various models.
  • ...and 3 more figures