Table of Contents
Fetching ...

Efficient Detection of Botnet Traffic by features selection and Decision Trees

Javier Velasco-Mata, Víctor González-Castro, Eduardo Fidalgo, Enrique Alegre

TL;DR

This work addresses botnet traffic detection in large-scale network data by applying feature selection to reduce the input feature set and accelerate classification. It evaluates two ranking methods, Gini Importance and Information Gain, to derive three feature subsets (5, 6, and 7 features) and tests three classifiers (Decision Tree, Random Forest, and k-NN) on two CTU-13 based datasets QB-CTU13 and EQB-CTU13. The results show that a single Decision Tree using five features achieves a strong speed-accuracy balance (approximately 85% macro F1 with very fast per-sample classification), outperforming more complex ensembles in terms of speed. The study also introduces quasi-balanced and extended botnet datasets to better reflect real-world conditions and demonstrates that k-NN and SVM baselines may lag in speed or robustness. Overall, the five-feature DT detector offers a practical, real-time capable solution for botnet detection with competitive accuracy.

Abstract

Botnets are one of the online threats with the biggest presence, causing billionaire losses to global economies. Nowadays, the increasing number of devices connected to the Internet makes it necessary to analyze large amounts of network traffic data. In this work, we focus on increasing the performance on botnet traffic classification by selecting those features that further increase the detection rate. For this purpose we use two feature selection techniques, Information Gain and Gini Importance, which led to three pre-selected subsets of five, six and seven features. Then, we evaluate the three feature subsets along with three models, Decision Tree, Random Forest and k-Nearest Neighbors. To test the performance of the three feature vectors and the three models we generate two datasets based on the CTU-13 dataset, namely QB-CTU13 and EQB-CTU13. We measure the performance as the macro averaged F1 score over the computational time required to classify a sample. The results show that the highest performance is achieved by Decision Trees using a five feature set which obtained a mean F1 score of 85% classifying each sample in an average time of 0.78 microseconds.

Efficient Detection of Botnet Traffic by features selection and Decision Trees

TL;DR

This work addresses botnet traffic detection in large-scale network data by applying feature selection to reduce the input feature set and accelerate classification. It evaluates two ranking methods, Gini Importance and Information Gain, to derive three feature subsets (5, 6, and 7 features) and tests three classifiers (Decision Tree, Random Forest, and k-NN) on two CTU-13 based datasets QB-CTU13 and EQB-CTU13. The results show that a single Decision Tree using five features achieves a strong speed-accuracy balance (approximately 85% macro F1 with very fast per-sample classification), outperforming more complex ensembles in terms of speed. The study also introduces quasi-balanced and extended botnet datasets to better reflect real-world conditions and demonstrates that k-NN and SVM baselines may lag in speed or robustness. Overall, the five-feature DT detector offers a practical, real-time capable solution for botnet detection with competitive accuracy.

Abstract

Botnets are one of the online threats with the biggest presence, causing billionaire losses to global economies. Nowadays, the increasing number of devices connected to the Internet makes it necessary to analyze large amounts of network traffic data. In this work, we focus on increasing the performance on botnet traffic classification by selecting those features that further increase the detection rate. For this purpose we use two feature selection techniques, Information Gain and Gini Importance, which led to three pre-selected subsets of five, six and seven features. Then, we evaluate the three feature subsets along with three models, Decision Tree, Random Forest and k-Nearest Neighbors. To test the performance of the three feature vectors and the three models we generate two datasets based on the CTU-13 dataset, namely QB-CTU13 and EQB-CTU13. We measure the performance as the macro averaged F1 score over the computational time required to classify a sample. The results show that the highest performance is achieved by Decision Trees using a five feature set which obtained a mean F1 score of 85% classifying each sample in an average time of 0.78 microseconds.

Paper Structure

This paper contains 15 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Visualization of the experimentation.
  • Figure 2: Two computers establish a communication using the TCP protocol. The set of interchanged packets is named flow, and by extracting its features we constituted a sample.
  • Figure 3: Gini Importance (GI) and Information Gain (IG) of the 11 selected features over the QB-CTU13 dataset.
  • Figure 4: Evolution of the F1 score by adding features to classify the traffic. On the top graphics (a and b) we compare the F1 score of different versions of RF integrating $m$ DTs, and on the bottom (c and d) we compare different versions of k-NN by varying the number $k$ of neighbors. On the left graphics (a and c), the features were added following the Gini Importance, while on the right ones (b and d) they were added following the Information Gain.
  • Figure 5: Evolution of the F1 scores by increasing the number of considered features following the Gini Importance ranking. The $x$ axis shows the feature that is added: First, the models started only using dPort, then they used dPort and nPackets, and so on. In the case of RF and k-NN, only the optimal values of $m$ and $k$ are represented.
  • ...and 4 more figures