Table of Contents
Fetching ...

An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

Renqi Chen, Wenwei Han, Haohao Zhang, Haoyang Su, Zhefan Wang, Xiaolei Liu, Hao Jiang, Wanli Ouyang, Nanqing Dong

TL;DR

To unleash the unexplored potential of attention mechanism for the task of interest, a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence is proposed that achieves overall superior performance against seminal methods on GS tasks of interest.

Abstract

Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.

An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

TL;DR

To unleash the unexplored potential of attention mechanism for the task of interest, a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence is proposed that achieves overall superior performance against seminal methods on GS tasks of interest.

Abstract

Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.
Paper Structure (26 sections, 11 equations, 2 figures, 10 tables)

This paper contains 26 sections, 11 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Illustration of the proposed framework, consisting of two modules, (a) a pre-processing module and (b) a learning module. (a) The raw SNP sequence is first pre-processed by a many-to-one mapping rule. Let $Z$ denote an arbitrary index that does not belong to the set, i.e., {A, T, C, G}, which are the four most frequent letters in SNP sequence. $Z$ is mapped to $X$ to reduce computational cost while it does not influence performance. The pre-processed sequence is then input into the tokenizer composed of k-mer and random masking to get token ID sequence and then the embedding layer. (b) Transformer encoder and MLP layers are adopted on embedding vectors to predict phenotype.
  • Figure 2: Sensitivity analysis on $k$ for k-mer tokenization on the wheat3k dataset. The optimal value of $k$ might be dependent on the task (phenotype).