Table of Contents
Fetching ...

GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction from Gene Sequences

Jingquan Yan, Yuwei Miao, Lei Yu, Yuzhi Guo, Xue Xiao, Lin Xu, Junzhou Huang

TL;DR

This work addresses the challenge of predicting knockout-induced phenotypic abnormalities directly from gene sequences, bridging the gap between molecular sequences and organism-level traits. It introduces GenePheno, an interpretable end-to-end framework that combines a sequence encoder with a mechanism-aware GO bottleneck, a contrastive multi-label objective, and an exclusivity regularization to capture inter-phenotype dependencies while enforcing biological constraints. The model is evaluated on four curated datasets (MPO, HPO, GWAS, CAFA2 wPPI) and achieves state-of-the-art performance in both gene-centric $F_{\max}$ and phenotype-centric $AUC$, with ablations confirming the contributions of the contrastive loss, GO integration, and exclusivity regularization. Case studies demonstrate that the learned bottleneck weights align with known functional mechanisms, providing interpretable insights into gene-to-phenotype formation and enabling scalable analysis of unannotated genes.

Abstract

Exploring how genetic sequences shape phenotypes is a fundamental challenge in biology and a key step toward scalable, hypothesis-driven experimentation. The task is complicated by the large modality gap between sequences and phenotypes, as well as the pleiotropic nature of gene-phenotype relationships. Existing sequence-based efforts focus on the degree to which variants of specific genes alter a limited set of phenotypes, while general gene knockout induced phenotype abnormality prediction methods heavily rely on curated genetic information as inputs, which limits scalability and generalizability. As a result, the task of broadly predicting the presence of multiple phenotype abnormalities under gene knockout directly from gene sequences remains underexplored. We introduce GenePheno, the first interpretable multi-label prediction framework that predicts knockout induced phenotypic abnormalities from gene sequences. GenePheno employs a contrastive multi-label learning objective that captures inter-phenotype correlations, complemented by an exclusive regularization that enforces biological consistency. It further incorporates a gene function bottleneck layer, offering human interpretable concepts that reflect functional mechanisms behind phenotype formation. To support progress in this area, we curate four datasets with canonical gene sequences as input and multi-label phenotypic abnormalities induced by gene knockouts as targets. Across these datasets, GenePheno achieves state-of-the-art gene-centric $F_{\text{max}}$ and phenotype-centric AUC, and case studies demonstrate its ability to reveal gene functional mechanisms.

GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction from Gene Sequences

TL;DR

This work addresses the challenge of predicting knockout-induced phenotypic abnormalities directly from gene sequences, bridging the gap between molecular sequences and organism-level traits. It introduces GenePheno, an interpretable end-to-end framework that combines a sequence encoder with a mechanism-aware GO bottleneck, a contrastive multi-label objective, and an exclusivity regularization to capture inter-phenotype dependencies while enforcing biological constraints. The model is evaluated on four curated datasets (MPO, HPO, GWAS, CAFA2 wPPI) and achieves state-of-the-art performance in both gene-centric and phenotype-centric , with ablations confirming the contributions of the contrastive loss, GO integration, and exclusivity regularization. Case studies demonstrate that the learned bottleneck weights align with known functional mechanisms, providing interpretable insights into gene-to-phenotype formation and enabling scalable analysis of unannotated genes.

Abstract

Exploring how genetic sequences shape phenotypes is a fundamental challenge in biology and a key step toward scalable, hypothesis-driven experimentation. The task is complicated by the large modality gap between sequences and phenotypes, as well as the pleiotropic nature of gene-phenotype relationships. Existing sequence-based efforts focus on the degree to which variants of specific genes alter a limited set of phenotypes, while general gene knockout induced phenotype abnormality prediction methods heavily rely on curated genetic information as inputs, which limits scalability and generalizability. As a result, the task of broadly predicting the presence of multiple phenotype abnormalities under gene knockout directly from gene sequences remains underexplored. We introduce GenePheno, the first interpretable multi-label prediction framework that predicts knockout induced phenotypic abnormalities from gene sequences. GenePheno employs a contrastive multi-label learning objective that captures inter-phenotype correlations, complemented by an exclusive regularization that enforces biological consistency. It further incorporates a gene function bottleneck layer, offering human interpretable concepts that reflect functional mechanisms behind phenotype formation. To support progress in this area, we curate four datasets with canonical gene sequences as input and multi-label phenotypic abnormalities induced by gene knockouts as targets. Across these datasets, GenePheno achieves state-of-the-art gene-centric and phenotype-centric AUC, and case studies demonstrate its ability to reveal gene functional mechanisms.

Paper Structure

This paper contains 50 sections, 2 theorems, 35 equations, 3 figures, 7 tables.

Key Result

Proposition 1

Let $\mathcal{L}$ be the contrastive MLC loss with our proposed soft exclusive regularization weighted by any $\lambda>0$, namely $\mathcal{L}\;=\;\mathcal{L}_{\mathrm{MLC}}\;+\;\lambda \mathcal{L}_\mathrm{ex}$. For any pair $(i,j)\in\mathcal{E}$ and any input $x$, every first‑order stationary point

Figures (3)

  • Figure 1: Overview of key biological modalities linking molecular-level DNA sequences to organism-level phenotypic traits and representative methods modeling different stages. Our method, GenePheno, bridges the modality gap in an end-to-end manner.
  • Figure 2: Overview of our learning framework. The GO function DAG comprises three subgraphs, each with its own root, while the phenotype ontology DAG consists of a single graph with one root. In both structures, node depth $d$ denotes the shortest path to the root, with deeper nodes representing more specific functions or phenotypes. We utilize GO functions at dual granularity: fine-grain inputs ($d>2$) for detailed functional information, and coarse-grain bottleneck supervision ($d=2$) for general mechanisms. Target phenotypes span both general and specific categories.
  • Figure 3: Sample bottleneck weight heatmap of MPO, HPO, and GWAS datasets. Darker colors indicate that the phenotype and GO function have a higher correlation.

Theorems & Definitions (2)

  • Proposition 1: Stationary Analysis of Exclusive Regularization
  • Theorem 1: Generalization and Exclusivity Guarantee