Table of Contents
Fetching ...

Predicting BVD Re-emergence in Irish Cattle From Highly Imbalanced Herd-Level Data Using Machine Learning Algorithms

Niamh Mimnagh, Andrew Parnell, Conor McAloon, Jaden Carlson, Maria Guelbenzu, Jonas Brock, Damien Barrett, Guy McGrath, Jamie Tratalos, Rafael Moral

TL;DR

This study tackles the risk of BVD re-emergence in Ireland after substantial eradication by evaluating a broad set of machine learning approaches on highly imbalanced herd-level data. It compares binary classification methods (GLMs, regularised regression, tree-based models, SVM) and anomaly detectors (LOF, ABOF, Mahalanobis, MCD, Isolation Forest, Autoencoders) under varying sample sizes and imbalance, incorporating resampling and class weighting. Across simulations and real data (2013–2023), Random Forest and XGBoost emerge as top performers, with RF achieving the highest sensitivity and AUC and correctly identifying 219 of 250 positive herds in 2023, while reducing blanket testing by about half. The findings support targeted surveillance strategies that balance detection of re-emergence with practical testing burdens, and they illustrate the value and limitations of imbalanced-data ML approaches for livestock disease monitoring.

Abstract

Bovine Viral Diarrhoea (BVD) has been the focus of a successful eradication programme in Ireland, with the herd-level prevalence declining from 11.3% in 2013 to just 0.2% in 2023. As the country moves toward BVD freedom, the development of predictive models for targeted surveillance becomes increasingly important to mitigate the risk of disease re-emergence. In this study, we evaluate the performance of a range of machine learning algorithms, including binary classification and anomaly detection techniques, for predicting BVD-positive herds using highly imbalanced herd-level data. We conduct an extensive simulation study to assess model performance across varying sample sizes and class imbalance ratios, incorporating resampling, class weighting, and appropriate evaluation metrics (sensitivity, positive predictive value, F1-score and AUC values). Random forests and XGBoost models consistently outperformed other methods, with the random forest model achieving the highest sensitivity and AUC across scenarios, including real-world prediction of 2023 herd status, correctly identifying 219 of 250 positive herds while halving the number of herds that require compared to a blanket-testing strategy.

Predicting BVD Re-emergence in Irish Cattle From Highly Imbalanced Herd-Level Data Using Machine Learning Algorithms

TL;DR

This study tackles the risk of BVD re-emergence in Ireland after substantial eradication by evaluating a broad set of machine learning approaches on highly imbalanced herd-level data. It compares binary classification methods (GLMs, regularised regression, tree-based models, SVM) and anomaly detectors (LOF, ABOF, Mahalanobis, MCD, Isolation Forest, Autoencoders) under varying sample sizes and imbalance, incorporating resampling and class weighting. Across simulations and real data (2013–2023), Random Forest and XGBoost emerge as top performers, with RF achieving the highest sensitivity and AUC and correctly identifying 219 of 250 positive herds in 2023, while reducing blanket testing by about half. The findings support targeted surveillance strategies that balance detection of re-emergence with practical testing burdens, and they illustrate the value and limitations of imbalanced-data ML approaches for livestock disease monitoring.

Abstract

Bovine Viral Diarrhoea (BVD) has been the focus of a successful eradication programme in Ireland, with the herd-level prevalence declining from 11.3% in 2013 to just 0.2% in 2023. As the country moves toward BVD freedom, the development of predictive models for targeted surveillance becomes increasingly important to mitigate the risk of disease re-emergence. In this study, we evaluate the performance of a range of machine learning algorithms, including binary classification and anomaly detection techniques, for predicting BVD-positive herds using highly imbalanced herd-level data. We conduct an extensive simulation study to assess model performance across varying sample sizes and class imbalance ratios, incorporating resampling, class weighting, and appropriate evaluation metrics (sensitivity, positive predictive value, F1-score and AUC values). Random forests and XGBoost models consistently outperformed other methods, with the random forest model achieving the highest sensitivity and AUC across scenarios, including real-world prediction of 2023 herd status, correctly identifying 219 of 250 positive herds while halving the number of herds that require compared to a blanket-testing strategy.

Paper Structure

This paper contains 22 sections, 19 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: A random forest composed of three trees. Each tree produces a class prediction, and the overall prediction is then determined as the majority of the individual tree predictions.
  • Figure 2: A hyperplane separating two-dimensional data into two classes, where the centre solid line represents the hyperplane, the two dashed lines represent the soft margin, and the shaded observations within the margin are the support vectors.
  • Figure 3: Representation of how a support vector machine may map linear data to a higher dimension to obtain separability. (a) In the original feature space, this two-class data is not linearly separable. (b) The data is transformed into a higher-dimensional space ($X_{1}^{2}$) to make linear separation of the classes possible.
  • Figure 4: Representation of the k-neighbourhood of observation A (highlighted in pink), where the shaded observations represent the three nearest neighbours, and the dashed circle represents the k-neighbourhood $N_{3}(A)$.
  • Figure 5: The RD of observation A (highlighted pink) to each of its three nearest neighbours (highlighted blue). The nearest neighbours to observations $B_{1}$,$B_{2}$,and $B_{3}$ are shaded, while the k-neighbourhood for observations $B_{1}$,$B_{2}$,and $B_{3}$ is represented with by a dashed circle.
  • ...and 4 more figures