Unveiling the Digital Fingerprints: Analysis of Internet attacks based on website fingerprints
Blerim Rexha, Arbena Musa, Kamer Vishi, Edlira Martiri
TL;DR
This work studies how digital fingerprints in network traffic can reveal Tor user activity despite encryption. It introduces an end-to-end framework utilizing public and controlled data, including data collection, preprocessing of .pcapng to flow features, hyperparameter tuning via grid search, and evaluation of seven classifiers. Key findings show that gradient boosting (GBM) achieves a binary accuracy of $0.8363$ and random forest (RF) achieves a multi-class accuracy of $0.6297$, demonstrating substantial deanonymization risk on a relatively small dataset of about $2.1\times 10^4$ samples. The results highlight privacy implications for WF defenses and motivate future work on defenses and policy measures in network privacy.
Abstract
Parallel to our physical activities our virtual presence also leaves behind our unique digital fingerprints, while navigating on the Internet. These digital fingerprints have the potential to unveil users' activities encompassing browsing history, utilized applications, and even devices employed during these engagements. Many Internet users tend to use web browsers that provide the highest privacy protection and anonymization such as Tor. The success of such privacy protection depends on the Tor feature to anonymize end-user IP addresses and other metadata that constructs the website fingerprint. In this paper, we show that using the newest machine learning algorithms an attacker can deanonymize Tor traffic by applying such techniques. In our experimental framework, we establish a baseline and comparative reference point using a publicly available dataset from Universidad Del Cauca, Colombia. We capture network packets across 11 days, while users navigate specific web pages, recording data in .pcapng format through the Wireshark network capture tool. Excluding extraneous packets, we employ various machine learning algorithms in our analysis. The results show that the Gradient Boosting Machine algorithm delivers the best outcomes in binary classification, achieving an accuracy of 0.8363. In the realm of multi-class classification, the Random Forest algorithm attains an accuracy of 0.6297.
