Selection of a Minimal Number of Significant Porcine SNPs by an Information Gain and Genetic Algorithm Hybrid Model
Wanthanee Rathasamuth, Kitsuchart Pasupa, Sissades Tongsima
TL;DR
This work tackles the challenge of identifying a minimal yet informative SNP subset for swine breed identification by integrating information gain as a filter with a genetic algorithm wrapper, complemented by a frequency-based feature selection step and evaluated with support vector machines. The proposed IG+Proposed GA+FFS approach reduces 16,579 SNPs to 142 markers (0.86%) while achieving a peak classification accuracy of $94.80\%$, outperforming using all SNPs. The method is validated on a diverse porcine dataset and reinforced by ANOVA and PCA analyses, which confirm statistical significance and preservation of breed structure with the reduced panel. The approach offers a scalable, high-accuracy solution for genomic breed classification and paves the way for biologically interpreting the selected SNPs through subsequent pathway and GO analyses.
Abstract
A panel of large number of common Single Nucleotide Polymorphisms (SNPs) distributed across an entire porcine genome has been widely used to represent genetic variability of pig. With the advent of SNP-array technology, a genome-wide genetic profile of a specimen can be easily observed. Among the large number of such variations, there exist a much smaller subset of the SNP panel that could equally be used to correctly identify the corresponding breed. This work presents a SNP selection heuristic that can still be used effectively in the breed classification process. The proposed feature selection was done by the approach of combining a filter method and a wrapper method--information gain method and genetic algorithm--plus a feature frequency selection step, while classification was done by support vector machine. The approach was able to reduce the number of significant SNPs to 0.86 % of the total number of SNPs in a swine dataset and provided a high classification accuracy of 94.80 %.
