Table of Contents
Fetching ...

Genomic Next-Token Predictors are In-Context Learners

Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi

TL;DR

The paper addresses whether in-context learning (ICL) can emerge in non-linguistic domains by testing a genomic next-nucleotide predictor (Evo2) against linguistic baselines using a cross-domain bitstring induction framework. It introduces a controlled setup where symbolic transformations are rendered in both genomic (A/C/G/T) and linguistic forms, and evaluates exact-match accuracy as the number of demonstrations grows, formalizing metrics with Monte Carlo sampling and a mode baseline. Results show that both Evo2 and Qwen3 exhibit log-linear improvements in accuracy with increasing demonstrations, with Evo2 achieving strong, and often superior, ICL performance at comparable scales compared to language models. These findings support a modality-agnostic view of ICL arising from large-scale predictive compression over rich sequence data and motivate broader cross-domain investigations into the mechanisms and scope of emergent meta-learning.

Abstract

In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.

Genomic Next-Token Predictors are In-Context Learners

TL;DR

The paper addresses whether in-context learning (ICL) can emerge in non-linguistic domains by testing a genomic next-nucleotide predictor (Evo2) against linguistic baselines using a cross-domain bitstring induction framework. It introduces a controlled setup where symbolic transformations are rendered in both genomic (A/C/G/T) and linguistic forms, and evaluates exact-match accuracy as the number of demonstrations grows, formalizing metrics with Monte Carlo sampling and a mode baseline. Results show that both Evo2 and Qwen3 exhibit log-linear improvements in accuracy with increasing demonstrations, with Evo2 achieving strong, and often superior, ICL performance at comparable scales compared to language models. These findings support a modality-agnostic view of ICL arising from large-scale predictive compression over rich sequence data and motivate broader cross-domain investigations into the mechanisms and scope of emergent meta-learning.

Abstract

In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.

Paper Structure

This paper contains 24 sections, 7 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: We design parallel symbolic reasoning tasks that allow direct comparison of ICL behavior across modalities (linguistic and genomic). Few-shot bitstring program-synthesis tasks (e.g., identity, NOT, majority, reverse) require models to infer mappings from examples. Each task is rendered in two modality-specific encodings: genomic (bitstrings mapped to random nucleotides A/T/C/G) and linguistic (bitstrings mapped to random digits), preserving abstract structure but differing in surface form. Both genomic (Evo2) and linguistic (Qwen3) models receive $k$-shot demonstrations and are greedily decoded to compute exact-match accuracy. Both models show log-linear accuracy gains with more demonstrations.
  • Figure 2: Few-shot performance of Qwen3 and Evo2 models. (a) Evo2 model performance with respect to log(shots). All models monotonically improve -- the 7B and 40B have roughly equivalent performance, and the 1B trails behind them. (b) Qwen3 model performance with respect to log(shots). All models improve, but not always monotonically. Smaller models struggle in 4-16 shot range. (c) At comparable sizes, Evo2 outperforms Qwen3. (d) Averaged performance across both model families shows consistent improvement with respect to log(shots). All models exceed the mode baseline shown in gray color.
  • Figure 3: Performance at $n=128$ shots. All model accuracies increase monotonically with respect to parameter count.
  • Figure 4: Accuracy vs. BitLoad averaged across all tasks (BitLoad; Eq.\ref{['eq:bitload']}). Qwen declines sharply with increasing BitLoad, while Evo degrades more gradually, indicating greater robustness. Details in §\ref{['sec:qual_anal:bit:load']}.
  • Figure 5: Few-shot behavior and scaling trends across Qwen3 and Evo2.
  • ...and 6 more figures