A mechanistically interpretable neural network for regulatory genomics

Alex M. Tseng; Gokcen Eraslan; Tommaso Biancalani; Gabriele Scalia

A mechanistically interpretable neural network for regulatory genomics

Alex M. Tseng, Gokcen Eraslan, Tommaso Biancalani, Gabriele Scalia

TL;DR

This work introduces ARGMINN, a mechanistically interpretable neural network for regulatory genomics that encodes motifs and their grammar directly into its architecture. It comprises a motif-scanner module that learns de novo, non-redundant motifs and a syntax-builder module that uses memory-stream–based attention to model interactions between motif instances, yielding an interpretable readout of motif instances and rules in any sequence. Theoretical results prove ARGMINN’s expressivity for first-order logic–defined motif configurations, while extensive experiments show improved motif discovery and motif-instance calling, robust performance under sequence variation, and the novel ability to perform interpretable sequence design. Together, these contributions offer a scalable, readable alternative to post hoc interpretability methods with potential impact on disease genomics, genome design, and regulatory biology.

Abstract

Deep neural networks excel in mapping genomic DNA sequences to associated readouts (e.g., protein-DNA binding). Beyond prediction, the goal of these networks is to reveal to scientists the underlying motifs (and their syntax) which drive genome regulation. Traditional methods that extract motifs from convolutional filters suffer from the uninterpretable dispersion of information across filters and layers. Other methods which rely on importance scores can be unstable and unreliable. Instead, we designed a novel mechanistically interpretable architecture for regulatory genomics, where motifs and their syntax are directly encoded and readable from the learned weights and activations. We provide theoretical and empirical evidence of our architecture's full expressivity, while still being highly interpretable. Through several experiments, we show that our architecture excels in de novo motif discovery and motif instance calling, is robust to variable sequence contexts, and enables fully interpretable generation of novel functional sequences.

A mechanistically interpretable neural network for regulatory genomics

TL;DR

Abstract

Paper Structure (23 sections, 2 theorems, 18 equations, 16 figures, 7 tables)

This paper contains 23 sections, 2 theorems, 18 equations, 16 figures, 7 tables.

Introduction
Related Work
ARGMINN Architecture
Module 1: Motif Scanners
Module 2: Syntax Builder
Experimental Results
Improved motif discovery
Improved motif instance calling and syntax discovery
QTL prioritization
Robustness of ARGMINN
Interpretable design of novel functional sequences
Theoretical results
Experimental follow-up to theoretical results
Discussion
Supplementary Proofs
...and 8 more sections

Key Result

Theorem 1

Every configuration of motifs/subsequences which is definable by a sentence in first-order logic (with positional variables) is recognizable by an ARGMINN classifier.

Figures (16)

Figure 1: Schematic of the ARGMINN architecture. a) The motif-scanner module produces activations denoting which motifs were found at each position, where activation magnitude reflects match strength. The activations are passed to the syntax builder, which learns higher-order logic between motif instances for the final prediction. b) The motif-scanner module is a single convolutional layer which learns all motifs de novo. Regularization ensures that each filter learns one motif (and vice versa), penalizing different filters from activating based on the same underlying subsequences. c) The syntax builder is a series of uniquely designed attention layers. An explicit memory stream tracks the model's state. Each attention layer derives a single query vector from the memory stream, and key/value vectors from the original activations, to update the memory stream.
Figure 2: Motif discovery. a) Example of SPI1 motifs discovered by ARGMINN, compared to interpreting the first-layer filters of a standard CNN, using ExplaiNN, and by clustering DeepLIFTShap importance scores using MoDISco. Note that MoDISco combines forward and reverse-complement orientations. b) For each dataset, we show whether or not each method successfully recovered each relevant motif (above), and the amount of redundancy as the number of times each motif was discovered (below). c) To quantify accuracy of the discovered motifs, for each relevant motif we show the maximum similarity to motifs discovered by each method. d) For each dataset, we show the number of extraneous motifs---those which do not match any known relevant motif---that each method discovered (left). We show a few examples of such extraneous motifs discovered for the CTCF (HepG2) experimental dataset (right).
Figure 3: Motif instance calling and syntax discovery. a) ARGMINN calls motif instances in any query sequence in only a forward pass. High attention scores in any attention layer trace directly back to original filter activations, which directly map to sequence motifs. b) After training on an experimental dataset of REST binding in HepG2, ARGMINN revealed the unique binding syntax of REST in both the forward and reverse-complement orientations, where the half-sites (left and right) bind either adjacently or around 9--14 bp apart. c) We compare motif instances discovered by ARGMINN to the traditional approach of using MoDISco to discover motifs and subsequently scanning for them with FIMO. We rank motif instances by confidence (attention score from ARGMINN, or FIMO hit q-value), and compute the fraction of true instances that are covered in a top-$k$ fashion. We also compare to motif instances discovered by scanning for ARGMINN-discovered motifs with FIMO.
Figure 4: QTL classification. a) On a set of known DNase-sensitivity QTLs, we evaluated the ability of ARGMINN to prioritize true causal dsQTLs by quantifying the difference in predictions with and without the dsQTL mutation. b) We show two examples of causal dsQTLs, which fall in ARGMINN-discovered motif instances. In both cases, by making the dsQTL mutation, the ARGMINN-discovered binding site is disrupted, leading to a reduced prediction of accessibility.
Figure 5: Interpretable sequence design. a) After training on experimental data (e.g., HepG2 accessibility), ARGMINN reveals the functional motifs and their binding syntax which induce function. To generate a novel functional sequence, we insert motifs into a non-functional background, following the syntactical rules learned by ARGMINN. In this novel sequence, the mechanistic justification is fully controlled and known. b) We interpretably constructed novel HepG2-accessible sequences using ARGMINN and validated their accessibility using Borzoi. We compared the accessibility between generated sequences and natural sequences from the experiment (top), as well as between generated sequences and shuffled backgrounds (bottom).
...and 11 more figures

Theorems & Definitions (2)

Theorem 1
Corollary 1.1

A mechanistically interpretable neural network for regulatory genomics

TL;DR

Abstract

A mechanistically interpretable neural network for regulatory genomics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (2)