Semantically Rich Local Dataset Generation for Explainable AI in Genomics

Pedro Barbosa; Rosina Savisaar; Alcides Fonseca

Semantically Rich Local Dataset Generation for Explainable AI in Genomics

Pedro Barbosa, Rosina Savisaar, Alcides Fonseca

TL;DR

This work addresses the challenge of interpreting deep genomic sequence models by enabling local explanations through semantically diverse neighborhood datasets. It introduces a grammar-guided genetic programming framework that evolves perturbations of input sequences, constrained by a domain-aware representation, and collects promising perturbations in an archive to form a final local dataset. The approach yields significant gains over random perturbation baselines in RNA splicing scenarios, achieving roughly a 30% improvement in archive quality and demonstrating robust generalization to longer sequences. The findings highlight the value of incorporating biological constraints and locality-aware mutations to better sample the semantic space, enhancing downstream explainability analyses in genomics.

Abstract

Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms. Therefore, interpreting these models may provide novel insights into the underlying biology, supporting downstream biomedical applications. Due to their complexity, interpretable surrogate models can only be built for local explanations (e.g., a single instance). However, accomplishing this requires generating a dataset in the neighborhood of the input, which must maintain syntactic similarity to the original data while introducing semantic variability in the model's predictions. This task is challenging due to the complex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity. Our custom, domain-guided individual representation effectively constrains syntactic similarity, and we provide two alternative fitness functions that promote diversity with no computational effort. Applied to the RNA splicing domain, our approach quickly achieves good diversity and significantly outperforms a random baseline in exploring the search space, as shown by our proof-of-concept, short RNA sequence. Furthermore, we assess its generalizability and demonstrate scalability to larger sequences, resulting in a ~30% improvement over the baseline.

Semantically Rich Local Dataset Generation for Explainable AI in Genomics

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 8 figures, 2 tables)

This paper contains 21 sections, 5 equations, 8 figures, 2 tables.

Introduction
Related work
Proposed approach
Overview
Representation
Archive
Fitness Functions
Genetic Operators
Evaluation methodology
Case Study
Experimental settings
Baseline
Hyperparameter Optimization
Results
Performance comparison
...and 6 more sections

Figures (8)

Figure 1: Summary of the proposed methodology.
Figure 2: The core structure of the grammar used to represent an individual in respect to the original sequence, presented in EBNF. Underlined symbols are terminals or meta-handlers.
Figure 3: Left: Average archive quality throughout the search procedure for four strategies (_BinFiller, RandomSearch_BinFiller, _ and RandomSearch_). We show only the top 5 trials of each strategy, each representing the average of 5 seeds. Right: Distribution of the archive quality of the top trial of each strategy over 30 seeds. Statistical significance was assessed using Welch's t-tests for three pairs of samples: two comparing the means of the and baseline when using the same fitness function, and one comparing the with different fitness functions. P-values were adjusted for multiple testing using the Bonferroni correction.
Figure 4: Left: Average edit distance and archive quality of tournament vs lexicase selection of 30 different runs. The horizontal lines reflect the median archive quality of each selection method. Right: Averaged archive size throughout 30 different runs.
Figure 5: Impact of the frequency of the custom mutation operator (vs tree-based mutation) in the final archive quality, across 30 seeds.
...and 3 more figures

Semantically Rich Local Dataset Generation for Explainable AI in Genomics

TL;DR

Abstract

Semantically Rich Local Dataset Generation for Explainable AI in Genomics

Authors

TL;DR

Abstract

Table of Contents

Figures (8)