Table of Contents
Fetching ...

Selection of a Minimal Number of Significant Porcine SNPs by an Information Gain and Genetic Algorithm Hybrid Model

Wanthanee Rathasamuth, Kitsuchart Pasupa, Sissades Tongsima

TL;DR

This work tackles the challenge of identifying a minimal yet informative SNP subset for swine breed identification by integrating information gain as a filter with a genetic algorithm wrapper, complemented by a frequency-based feature selection step and evaluated with support vector machines. The proposed IG+Proposed GA+FFS approach reduces 16,579 SNPs to 142 markers (0.86%) while achieving a peak classification accuracy of $94.80\%$, outperforming using all SNPs. The method is validated on a diverse porcine dataset and reinforced by ANOVA and PCA analyses, which confirm statistical significance and preservation of breed structure with the reduced panel. The approach offers a scalable, high-accuracy solution for genomic breed classification and paves the way for biologically interpreting the selected SNPs through subsequent pathway and GO analyses.

Abstract

A panel of large number of common Single Nucleotide Polymorphisms (SNPs) distributed across an entire porcine genome has been widely used to represent genetic variability of pig. With the advent of SNP-array technology, a genome-wide genetic profile of a specimen can be easily observed. Among the large number of such variations, there exist a much smaller subset of the SNP panel that could equally be used to correctly identify the corresponding breed. This work presents a SNP selection heuristic that can still be used effectively in the breed classification process. The proposed feature selection was done by the approach of combining a filter method and a wrapper method--information gain method and genetic algorithm--plus a feature frequency selection step, while classification was done by support vector machine. The approach was able to reduce the number of significant SNPs to 0.86 % of the total number of SNPs in a swine dataset and provided a high classification accuracy of 94.80 %.

Selection of a Minimal Number of Significant Porcine SNPs by an Information Gain and Genetic Algorithm Hybrid Model

TL;DR

This work tackles the challenge of identifying a minimal yet informative SNP subset for swine breed identification by integrating information gain as a filter with a genetic algorithm wrapper, complemented by a frequency-based feature selection step and evaluated with support vector machines. The proposed IG+Proposed GA+FFS approach reduces 16,579 SNPs to 142 markers (0.86%) while achieving a peak classification accuracy of , outperforming using all SNPs. The method is validated on a diverse porcine dataset and reinforced by ANOVA and PCA analyses, which confirm statistical significance and preservation of breed structure with the reduced panel. The approach offers a scalable, high-accuracy solution for genomic breed classification and paves the way for biologically interpreting the selected SNPs through subsequent pathway and GO analyses.

Abstract

A panel of large number of common Single Nucleotide Polymorphisms (SNPs) distributed across an entire porcine genome has been widely used to represent genetic variability of pig. With the advent of SNP-array technology, a genome-wide genetic profile of a specimen can be easily observed. Among the large number of such variations, there exist a much smaller subset of the SNP panel that could equally be used to correctly identify the corresponding breed. This work presents a SNP selection heuristic that can still be used effectively in the breed classification process. The proposed feature selection was done by the approach of combining a filter method and a wrapper method--information gain method and genetic algorithm--plus a feature frequency selection step, while classification was done by support vector machine. The approach was able to reduce the number of significant SNPs to 0.86 % of the total number of SNPs in a swine dataset and provided a high classification accuracy of 94.80 %.

Paper Structure

This paper contains 13 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Procedural steps of GA in operation with SVM.
  • Figure 2: An example of binary bit strings of genes that make up two chromosomes.
  • Figure 3: An experimental framework of feature selection for classification.
  • Figure 4: Application of FFS to combining and selecting features from linear and RBF kernels.
  • Figure 5: Classification accuracies and numbers of selected SNPs resulted from using a range of $P_m$ values.
  • ...and 4 more figures