Table of Contents
Fetching ...

Cross-Granularity Representations for Biological Sequences: Insights from ESM and BiGCARP

Hanlin Xiao, Rainer Breitling, Eriko Takano, Mauricio A. Álvarez

Abstract

Recent advances in general-purpose foundation models have stimulated the development of large biological sequence models. While natural language shows symbolic granularity (characters, words, sentences), biological sequences exhibit hierarchical granularity whose levels (nucleotides, amino acids, protein domains, genes) further encode biologically functional information. In this paper, we investigate the integration of cross-granularity knowledge from models through a case study of BiGCARP, a Pfam domain-level model for biosynthetic gene clusters, and ESM, an amino acid-level protein language model. Using representation analysis tools and a set of probe tasks, we first explain why a straightforward cross-model embedding initialization fails to improve downstream performance in BiGCARP, and show that deeper-layer embeddings capture a more contextual and faithful representation of the model's learned knowledge. Furthermore, we demonstrate that representations at different granularities encode complementary biological knowledge, and that combining them yields measurable performance gains in intermediate-level prediction tasks. Our findings highlight cross-granularity integration as a promising strategy for improving both the performance and interpretability of biological foundation models.

Cross-Granularity Representations for Biological Sequences: Insights from ESM and BiGCARP

Abstract

Recent advances in general-purpose foundation models have stimulated the development of large biological sequence models. While natural language shows symbolic granularity (characters, words, sentences), biological sequences exhibit hierarchical granularity whose levels (nucleotides, amino acids, protein domains, genes) further encode biologically functional information. In this paper, we investigate the integration of cross-granularity knowledge from models through a case study of BiGCARP, a Pfam domain-level model for biosynthetic gene clusters, and ESM, an amino acid-level protein language model. Using representation analysis tools and a set of probe tasks, we first explain why a straightforward cross-model embedding initialization fails to improve downstream performance in BiGCARP, and show that deeper-layer embeddings capture a more contextual and faithful representation of the model's learned knowledge. Furthermore, we demonstrate that representations at different granularities encode complementary biological knowledge, and that combining them yields measurable performance gains in intermediate-level prediction tasks. Our findings highlight cross-granularity integration as a promising strategy for improving both the performance and interpretability of biological foundation models.
Paper Structure (27 sections, 5 figures, 3 tables)

This paper contains 27 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Multi-granularity nature of biological sequences and its analogy to tokenization in NLP. a) Gene sequences at the nucleotide level can be transcribed and translated into amino acids, which can be further abstracted into Pfam domains by HMM scans. A biosynthetic gene cluster (BGC) can thus be represented as a sequence of Pfam domains, the granularity used in the BiGCARP model. b) Similarly, NLP sequences can be represented at different granularities, from characters through subwords (the "token" here) to words and sentences. Unlike statistical NLP tokenizers (e.g., byte-pair encoding), which are reversible, biological “tokenizers” are lossy but inject functional priors by providing higher-level abstractions.
  • Figure 2: Overview of the BiGCARP model and ESM-based initialization.Left: A BGC is represented as a sequence of Pfam domains, which are embedded through an embedding matrix initialized either randomly or with ESM. BiGCARP processes these embeddings into last-layer representations, which are passed to a prediction head for masked language modeling and downstream probe tasks. Right: In ESM-based initialization, each Pfam domain is associated with a representative amino acid sequence. The sequence is encoded by the pre-trained ESM model, mean-pooled, and used to construct the embedding vector for that domain in BiGCARP. Blue blocks illustrate an example of domain PF1 receiving its embedding initialization $\{\text{PF}_{i}\}_{i=1}^n$ from its representative amino acid sequence (right), and progressing from its embedder-layer embeddings $\{\text{PF}_{iE}\}_{i=1}^n$ to its last-layer embeddings $\{\text{PF}_{iL}\}_{i=1}^n$ (left).
  • Figure 3: Two-dimensional UMAP projection of BiGCARP last-layer embeddings. All Pfam domains in the training corpus are shown and colored by functional category. Functionally coherent clusters emerge, indicating that BiGCARP captures high-level functional information. Three representative clusters are selected and analyzed in the main text to illustrate their underlying themes.
  • Figure 4: CKA self-similarity of the ESM-initialized model across layers. The four-period pattern reflects the model’s four-block architecture. A sharp change between layers 0 and 1 arises from the dilated convolution layer, while the low similarity between the first and last layers indicates limited retention of initial representations.
  • Figure 5: Evolution of CKA similarity during training across layers and initialization schemes. The near-constant similarity of the ESM-initialized embedder to itself (blue) indicates that it is trainable but minimally reshaped during training. The red and purple curves show that, despite different initialization strategies, the similarity between initial and last layers converges to a similar level, suggesting that ESM initialization does not preserve additional information in the final layer. The low similarity of cross-model comparisons (green) provides a consistency check.