Table of Contents
Fetching ...

BOTracle: A framework for Discriminating Bots and Humans

Jan Kadel, August See, Ritwik Sinha, Mathias Fischer

TL;DR

This work tackles bot versus human discrimination in high-traffic web environments by integrating a three-tier detection pipeline: fast heuristics, a Semi-Supervised Generative Adversarial Network (SGAN) leveraging labeled and unlabeled data, and a Deep Graph Convolutional Neural Network (DGCNN) operating on Website Traversal (WT) graphs to capture session-wide behavior. The pipeline uses a confidence threshold $\lambda$ to gate predictions and progressively enriches uncertain hits with WT-graph analysis, aiming to minimize user disruption while maintaining high accuracy. Evaluations on a real-world e-commerce dataset (~40 million monthly visits) show that the proposed approach achieves precision, recall, and AUROC scores at or near 0.98 or higher, often surpassing Botcha, with WT-graph–based behavioral features providing a robust advantage. Limitations include lack of ground-truth sharing and the inherent challenges of bots that precisely mimic human behavior, guiding future work toward reducing labeling dependence and further reducing required user interaction for detection.

Abstract

Bots constitute a significant portion of Internet traffic and are a source of various issues across multiple domains. Modern bots often become indistinguishable from real users, as they employ similar methods to browse the web, including using real browsers. We address the challenge of bot detection in high-traffic scenarios by analyzing three distinct detection methods. The first method operates on heuristics, allowing for rapid detection. The second method utilizes, well known, technical features, such as IP address, window size, and user agent. It serves primarily for comparison with the third method. In the third method, we rely solely on browsing behavior, omitting all static features and focusing exclusively on how clients behave on a website. In contrast to related work, we evaluate our approaches using real-world e-commerce traffic data, comprising 40 million monthly page visits. We further compare our methods against another bot detection approach, Botcha, on the same dataset. Our performance metrics, including precision, recall, and AUC, reach 98 percent or higher, surpassing Botcha.

BOTracle: A framework for Discriminating Bots and Humans

TL;DR

This work tackles bot versus human discrimination in high-traffic web environments by integrating a three-tier detection pipeline: fast heuristics, a Semi-Supervised Generative Adversarial Network (SGAN) leveraging labeled and unlabeled data, and a Deep Graph Convolutional Neural Network (DGCNN) operating on Website Traversal (WT) graphs to capture session-wide behavior. The pipeline uses a confidence threshold to gate predictions and progressively enriches uncertain hits with WT-graph analysis, aiming to minimize user disruption while maintaining high accuracy. Evaluations on a real-world e-commerce dataset (~40 million monthly visits) show that the proposed approach achieves precision, recall, and AUROC scores at or near 0.98 or higher, often surpassing Botcha, with WT-graph–based behavioral features providing a robust advantage. Limitations include lack of ground-truth sharing and the inherent challenges of bots that precisely mimic human behavior, guiding future work toward reducing labeling dependence and further reducing required user interaction for detection.

Abstract

Bots constitute a significant portion of Internet traffic and are a source of various issues across multiple domains. Modern bots often become indistinguishable from real users, as they employ similar methods to browse the web, including using real browsers. We address the challenge of bot detection in high-traffic scenarios by analyzing three distinct detection methods. The first method operates on heuristics, allowing for rapid detection. The second method utilizes, well known, technical features, such as IP address, window size, and user agent. It serves primarily for comparison with the third method. In the third method, we rely solely on browsing behavior, omitting all static features and focusing exclusively on how clients behave on a website. In contrast to related work, we evaluate our approaches using real-world e-commerce traffic data, comprising 40 million monthly page visits. We further compare our methods against another bot detection approach, Botcha, on the same dataset. Our performance metrics, including precision, recall, and AUC, reach 98 percent or higher, surpassing Botcha.

Paper Structure

This paper contains 20 sections, 3 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Multi-Stage Bot Detection Pipeline Process as Flow Chart