Table of Contents
Fetching ...

Automated Classification of Dry Bean Varieties Using XGBoost and SVM Models

Ramtin Ardeshirifar

TL;DR

The study addresses automated classification of seven dry bean varieties from image-derived features. It compares XGBoost and SVM using PCA-based dimensionality reduction and a standardized preprocessing pipeline, validated with nested cross-validation. Both models achieve approximately 94% accuracy, with SVM slightly outperforming XGBoost, demonstrating the viability of automated seed classification for improving seed uniformity and crop yield. The work contributes to precision agriculture by providing robust, repeatable seed-quality-control methods and suggests expanding datasets and incorporating deeper learning techniques in future research.

Abstract

This paper presents a comparative study on the automated classification of seven different varieties of dry beans using machine learning models. Leveraging a dataset of 12,909 dry bean samples, reduced from an initial 13,611 through outlier removal and feature extraction, we applied Principal Component Analysis (PCA) for dimensionality reduction and trained two multiclass classifiers: XGBoost and Support Vector Machine (SVM). The models were evaluated using nested cross-validation to ensure robust performance assessment and hyperparameter tuning. The XGBoost and SVM models achieved overall correct classification rates of 94.00% and 94.39%, respectively. The results underscore the efficacy of these machine learning approaches in agricultural applications, particularly in enhancing the uniformity and efficiency of seed classification. This study contributes to the growing body of work on precision agriculture, demonstrating that automated systems can significantly support seed quality control and crop yield optimization. Future work will explore incorporating more diverse datasets and advanced algorithms to further improve classification accuracy.

Automated Classification of Dry Bean Varieties Using XGBoost and SVM Models

TL;DR

The study addresses automated classification of seven dry bean varieties from image-derived features. It compares XGBoost and SVM using PCA-based dimensionality reduction and a standardized preprocessing pipeline, validated with nested cross-validation. Both models achieve approximately 94% accuracy, with SVM slightly outperforming XGBoost, demonstrating the viability of automated seed classification for improving seed uniformity and crop yield. The work contributes to precision agriculture by providing robust, repeatable seed-quality-control methods and suggests expanding datasets and incorporating deeper learning techniques in future research.

Abstract

This paper presents a comparative study on the automated classification of seven different varieties of dry beans using machine learning models. Leveraging a dataset of 12,909 dry bean samples, reduced from an initial 13,611 through outlier removal and feature extraction, we applied Principal Component Analysis (PCA) for dimensionality reduction and trained two multiclass classifiers: XGBoost and Support Vector Machine (SVM). The models were evaluated using nested cross-validation to ensure robust performance assessment and hyperparameter tuning. The XGBoost and SVM models achieved overall correct classification rates of 94.00% and 94.39%, respectively. The results underscore the efficacy of these machine learning approaches in agricultural applications, particularly in enhancing the uniformity and efficiency of seed classification. This study contributes to the growing body of work on precision agriculture, demonstrating that automated systems can significantly support seed quality control and crop yield optimization. Future work will explore incorporating more diverse datasets and advanced algorithms to further improve classification accuracy.
Paper Structure (15 sections, 1 equation, 4 figures, 2 tables)

This paper contains 15 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The data distribution among classes is depicted in the pie chart. It is evident that DERMASON represents the largest class, accounting for 26.20% of the data, while BOMBAY constitutes the smallest class, representing only 3.83% of the data.
  • Figure 2: Each cell shows the correlation between two variables. A correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0. A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a perfect positive correlation. A correlation of 0.0 shows no linear relationship between the movement of the two variables.
  • Figure 3: This plot provides a visual representation of each component and the cumulative variance explained by each, starting with the first component. The x-axis of the plot represents the number of components, and the y-axis represents the cumulative explained variance in percentage. A total of 10 principal components were identified that cumulatively explained 99.99
  • Figure 4: Illustration of the nested cross-validation process.