When Handshakes Tell the Truth: Detecting Web Bad Bots via TLS Fingerprints
Ghalia Jarad, Kemal Bicakci
TL;DR
The study tackles the problem of distinguishing malicious web bots from human users by leveraging TLS handshake fingerprints, specifically JA4, as a protocol-level signal difficult for bots to spoof. It trains two gradient-boosted tree classifiers, XGBoost and CatBoost, on the JA4DB dataset to classify bot versus human traffic using JA4-derived features; CatBoost achieves the best results with an AUC around 0.998 and F1 ≈ 0.9734, while XGBoost performs comparably (AUC 0.998, F1 ≈ 0.9732). Feature importance analyses identify JA4_B, cipher_count, and ext_count as the strongest discriminators between bots and humans. The findings demonstrate that JA4-based TLS fingerprints provide a practically effective, privacy-preserving signal for bot detection that is robust to IP changes, with future work extending to new protocols like HTTP/3 and incorporating additional fingerprint features and adversarial testing.
Abstract
Automated traffic continued to surpass human-generated traffic on the web, and a rising proportion of this automation was explicitly malicious. Evasive bots could pretend to be real users, even solve Captchas and mimic human interaction patterns. This work explores a less intrusive, protocol-level method: using TLS fingerprinting with the JA4 technique to tell apart bots from real users. Two gradient-boosted machine learning classifiers (XGBoost and CatBoost) were trained and evaluated on a dataset of real TLS fingerprints (JA4DB) after feature extraction, which derived informative signals from JA4 fingerprints that describe TLS handshake parameters. The CatBoost model performed better, achieving an AUC of 0.998 and an F1 score of 0.9734. It was accurate 0.9863 of the time on the test set. The XGBoost model showed almost similar results. Feature significance analyses identified JA4 components, especially ja4\_b, cipher\_count, and ext\_count, as the most influential on model effectiveness. Future research will extend this method to new protocols, such as HTTP/3, and add additional device-fingerprinting features to test how well the system resists advanced bot evasion tactics.
