Table of Contents
Fetching ...

HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling

Max van Spengler, Artem Moskalev, Tommaso Mansi, Mangal Prakash, Rui Liao

TL;DR

Biological sequences exhibit hierarchical structure that Euclidean representations struggle to capture. HyperHELM introduces a hybrid hyperbolic language-modeling framework for mRNA that embeds codon hierarchy in the Poincaré ball and uses hyperbolic prototypes to guide MLM predictions. It achieves notable gains across downstream property prediction tasks, with up to around 10% improvements and enhanced generalization to long sequences and variable GC content, as well as improvements in antibody region annotation. The work demonstrates that hyperbolic geometry provides a principled inductive bias for hierarchical biology data, and that a practical, hybrid architecture can leverage this bias without the computational burden of fully hyperbolic networks.

Abstract

Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences.

HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling

TL;DR

Biological sequences exhibit hierarchical structure that Euclidean representations struggle to capture. HyperHELM introduces a hybrid hyperbolic language-modeling framework for mRNA that embeds codon hierarchy in the Poincaré ball and uses hyperbolic prototypes to guide MLM predictions. It achieves notable gains across downstream property prediction tasks, with up to around 10% improvements and enhanced generalization to long sequences and variable GC content, as well as improvements in antibody region annotation. The work demonstrates that hyperbolic geometry provides a principled inductive bias for hierarchical biology data, and that a practical, hybrid architecture can leverage this bias without the computational burden of fully hyperbolic networks.

Abstract

Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences.

Paper Structure

This paper contains 26 sections, 13 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: High-level overview of the HyperHELM method for MLM. The method consists of three main components: 1) the language modeling of mRNA, where a sequence transformer is used to obtain token representations, as shown in the left; 2) a hyperbolic embedding of the codon hierarchy (large version in Appendix \ref{['sec:hierarchy_appendix']}) is generated to serve as prototypes for guiding the language model during pre-training, shown on the right; and 3) hyperbolic hierarchical prototype learning, where the prototypes are used to predict the true label of masked tokens using either distances (green) or entailment cones (blue), visualized in the center.
  • Figure 2: Hyperbolic prototype learning. The center part presents a Poincaré disk where either distances (green) or entailment cone energies (blue) are used to predict the label of embedded tokens. On the left, a close up of a masked token representation with its closest prototype, together with the geodesic between these is shown. The right part takes a closer look at one of the entailment cones, showing the geometric interpretation of equations \ref{['eq:psi']}, \ref{['eq:energy']} and \ref{['eq:xi']}.
  • Figure 3: Relationship between codon usage metric (ENC) and HyperHELM performance gains. Hyperbolic gains are largest for sequences with higher codon usage bias indicated by lower ENC.
  • Figure 4: The codon hierarchy that is used for creating prototypes and structuring the representation space.