Table of Contents
Fetching ...

Seamless Website Fingerprinting in Multiple Environments

Chuxu Song, Zining Fan, Hao Wang, Richard Martin

TL;DR

This work rethinks website fingerprinting by moving from page-level to site-level classification using a CNN (WFNet) that operates on short 500-packet segments, enabling realistic, boundary-free attacks. It emphasizes training data diversity across multiple network environments and applies domain adaptation to achieve strong generalization to unseen locations, reporting >90% accuracy across varied conditions. The study demonstrates the importance of environment-aware training and transfer learning for robust WF, while also exploring privacy defenses such as inflation and active injection, which reveal the practicality and limits of countermeasures. Overall, the results highlight a significant privacy risk in encrypted traffic and VPNs, and suggest that protocol-level obfuscation and broad, adaptive defense strategies are necessary for robust protection.

Abstract

Website fingerprinting (WF) attacks identify the websites visited over anonymized connections by analyzing patterns in network traffic flows, such as packet sizes, directions, or interval times using a machine learning classifier. Previous studies showed WF attacks achieve high classification accuracy. However, several issues call into question whether existing WF approaches are realizable in practice and thus motivate a re-exploration. Due to Tor's performance issues and resulting poor browsing experience, the vast majority of users opt for Virtual Private Networking (VPN) despite VPNs weaker privacy protections. Many other past assumptions are increasingly unrealistic as web technology advances. Our work addresses several key limitations of prior art. First, we introduce a new approach that classifies entire websites rather than individual web pages. Site-level classification uses traffic from all site components, including advertisements, multimedia, and single-page applications. Second, our Convolutional Neural Network (CNN) uses only the jitter and size of 500 contiguous packets from any point in a TCP stream, in contrast to prior work requiring heuristics to find page boundaries. Our seamless approach makes eavesdropper attack models realistic. Using traces from a controlled browser, we show our CNN matches observed traffic to a website with over 90% accuracy. We found the training traffic quality is critical as classification accuracy is significantly reduced when the training data lacks variability in network location, performance, and clients' computational capability. We enhanced the base CNN's efficacy using domain adaptation, allowing it to discount irrelevant features, such as network location. Lastly, we evaluate several defensive strategies against seamless WF attacks.

Seamless Website Fingerprinting in Multiple Environments

TL;DR

This work rethinks website fingerprinting by moving from page-level to site-level classification using a CNN (WFNet) that operates on short 500-packet segments, enabling realistic, boundary-free attacks. It emphasizes training data diversity across multiple network environments and applies domain adaptation to achieve strong generalization to unseen locations, reporting >90% accuracy across varied conditions. The study demonstrates the importance of environment-aware training and transfer learning for robust WF, while also exploring privacy defenses such as inflation and active injection, which reveal the practicality and limits of countermeasures. Overall, the results highlight a significant privacy risk in encrypted traffic and VPNs, and suggest that protocol-level obfuscation and broad, adaptive defense strategies are necessary for robust protection.

Abstract

Website fingerprinting (WF) attacks identify the websites visited over anonymized connections by analyzing patterns in network traffic flows, such as packet sizes, directions, or interval times using a machine learning classifier. Previous studies showed WF attacks achieve high classification accuracy. However, several issues call into question whether existing WF approaches are realizable in practice and thus motivate a re-exploration. Due to Tor's performance issues and resulting poor browsing experience, the vast majority of users opt for Virtual Private Networking (VPN) despite VPNs weaker privacy protections. Many other past assumptions are increasingly unrealistic as web technology advances. Our work addresses several key limitations of prior art. First, we introduce a new approach that classifies entire websites rather than individual web pages. Site-level classification uses traffic from all site components, including advertisements, multimedia, and single-page applications. Second, our Convolutional Neural Network (CNN) uses only the jitter and size of 500 contiguous packets from any point in a TCP stream, in contrast to prior work requiring heuristics to find page boundaries. Our seamless approach makes eavesdropper attack models realistic. Using traces from a controlled browser, we show our CNN matches observed traffic to a website with over 90% accuracy. We found the training traffic quality is critical as classification accuracy is significantly reduced when the training data lacks variability in network location, performance, and clients' computational capability. We enhanced the base CNN's efficacy using domain adaptation, allowing it to discount irrelevant features, such as network location. Lastly, we evaluate several defensive strategies against seamless WF attacks.
Paper Structure (27 sections, 1 equation, 9 figures, 9 tables)

This paper contains 27 sections, 1 equation, 9 figures, 9 tables.

Figures (9)

  • Figure 1:
  • Figure 2: Neural Network Structure Implementing Domain Adaptation.
  • Figure 3: Structure of the WFNet-Base
  • Figure 4: This figure illustrates the distribution of padding zeros across each class in the AWF775 dataset. Labels are ordered by the average padding zeros per instance, sorted in descending order based on the data volume generated by each website. The blue line indicates the mean number of padding zeros, while the shaded area represents the standard deviation.
  • Figure 5: Accuracy of training on one location and testing on that location. Each location has 22 websites.
  • ...and 4 more figures