Table of Contents
Fetching ...

Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water

Isaiah K. Mutai, Kristof Van Laerhoven, Nancy W. Karuri, Robert K. Tewo

Abstract

There exist opportunities for Multivariate Linear Regression (MLR) in the prediction of Biochemical Oxygen Demand (BOD) in waste water, using the diverse water quality parameters as the input variables. The goal of this work is to examine the capability of MLR in prediction of BOD in waste water through four input variables: Dissolved Oxygen (DO), Nitrogen, Fecal Coliform and Total Coliform. The four input variables have higher correlation strength to BOD out of the seven parameters examined for the strength of correlation. Machine Learning (ML) was done with both 80% and 90% of the data as the training set and 20% and 10% as the test set respectively. MLR performance was evaluated through the coefficient of correlation (r), Root Mean Square Error (RMSE) and the percentage accuracy in prediction of BOD. The performance indices for the input variables of Dissolved Oxygen, Nitrogen, Fecal Coliform and Total Coliform in prediction of BOD are: RMSE=6.77mg/L, r=0.60 and accuracy 70.3% for training dataset of 80% and RMSE=6.74mg/L, r=0.60 and accuracy of 87.5% for training set of 90% of the dataset. It was found that increasing the percentage of the training set above 80% of the dataset improved the accuracy of the model only but did not have a significant impact on the prediction capacity of the model. The results showed that MLR model could be successfully employed in the estimation of BOD in waste water using appropriately selected input parameters.

Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water

Abstract

There exist opportunities for Multivariate Linear Regression (MLR) in the prediction of Biochemical Oxygen Demand (BOD) in waste water, using the diverse water quality parameters as the input variables. The goal of this work is to examine the capability of MLR in prediction of BOD in waste water through four input variables: Dissolved Oxygen (DO), Nitrogen, Fecal Coliform and Total Coliform. The four input variables have higher correlation strength to BOD out of the seven parameters examined for the strength of correlation. Machine Learning (ML) was done with both 80% and 90% of the data as the training set and 20% and 10% as the test set respectively. MLR performance was evaluated through the coefficient of correlation (r), Root Mean Square Error (RMSE) and the percentage accuracy in prediction of BOD. The performance indices for the input variables of Dissolved Oxygen, Nitrogen, Fecal Coliform and Total Coliform in prediction of BOD are: RMSE=6.77mg/L, r=0.60 and accuracy 70.3% for training dataset of 80% and RMSE=6.74mg/L, r=0.60 and accuracy of 87.5% for training set of 90% of the dataset. It was found that increasing the percentage of the training set above 80% of the dataset improved the accuracy of the model only but did not have a significant impact on the prediction capacity of the model. The results showed that MLR model could be successfully employed in the estimation of BOD in waste water using appropriately selected input parameters.
Paper Structure (9 sections, 3 equations, 7 figures, 3 tables)

This paper contains 9 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: A scree plot showing the explained variance with respect to the principal components. All the principal components contributed almost equally in the total accounting of the variation in the the data
  • Figure 2: Principal Component Analysis (PCA). The dataset falls into three major clusters with a few data points dispersed outside the clusters in the lower dimension space
  • Figure 3: Biplot of the PCA analysis on the first and second principal components. There is a strong correlation between Dissolved Oxygen, Fecal Coliforms, Total Coliforms and Nitrogen to BOD. pH and Fecal coliforms have no linear correlation toward each other
  • Figure 4: K-Means clustering of the data. The elbow rule shows the existence of 3 unique clusters in the data structure
  • Figure 5: t-Distributed Stochastic Neighbour Embedding. From (a) it is clear that the data fall into three major clusters based on the BOD values: 0-0.2,0.4-0.6 and 0.8-1.0, which can be clustered as: Low, Medium and High BOD Levels, hence we cluster and categorize as in (b).
  • ...and 2 more figures