An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

Renqi Chen; Wenwei Han; Haohao Zhang; Haoyang Su; Zhefan Wang; Xiaolei Liu; Hao Jiang; Wanli Ouyang; Nanqing Dong

An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

Renqi Chen, Wenwei Han, Haohao Zhang, Haoyang Su, Zhefan Wang, Xiaolei Liu, Hao Jiang, Wanli Ouyang, Nanqing Dong

TL;DR

To unleash the unexplored potential of attention mechanism for the task of interest, a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence is proposed that achieves overall superior performance against seminal methods on GS tasks of interest.

Abstract

Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.

An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

TL;DR

Abstract

Paper Structure (26 sections, 11 equations, 2 figures, 10 tables)

This paper contains 26 sections, 11 equations, 2 figures, 10 tables.

Introduction
Related Work
Analysis-based Genomic Selection
Deep Learning-based Genomic Selection
Sequence Representation
Problem Formulation
Methodology
Pre-processed SNP Sequence
Sequence Tokenizer
Phenotype Learning
Experiments
Setup
Datasets
Implementation
Evaluation Metrics
...and 11 more sections

Figures (2)

Figure 1: Illustration of the proposed framework, consisting of two modules, (a) a pre-processing module and (b) a learning module. (a) The raw SNP sequence is first pre-processed by a many-to-one mapping rule. Let $Z$ denote an arbitrary index that does not belong to the set, i.e., {A, T, C, G}, which are the four most frequent letters in SNP sequence. $Z$ is mapped to $X$ to reduce computational cost while it does not influence performance. The pre-processed sequence is then input into the tokenizer composed of k-mer and random masking to get token ID sequence and then the embedding layer. (b) Transformer encoder and MLP layers are adopted on embedding vectors to predict phenotype.
Figure 2: Sensitivity analysis on $k$ for k-mer tokenization on the wheat3k dataset. The optimal value of $k$ might be dependent on the task (phenotype).

An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

TL;DR

Abstract

An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

Authors

TL;DR

Abstract

Table of Contents

Figures (2)