Table of Contents
Fetching ...

Beyond the Crawl: Unmasking Browser Fingerprinting in Real User Interactions

Meenatchi Sundaram Muthu Selva Annamalai, Igor Bilogrevic, Emiliano De Cristofaro

TL;DR

The study shows that automated crawls substantially underestimate browser fingerprinting observed in real-user interactions, with 45% of fingerprinting websites missed in crawls. It collects real-user telemetry from 30 participants across 3,000 top-ranked sites over 10 weeks and compares it to an automated crawl, revealing higher fingerprinting prevalence and richer vectors (e.g., AudioContext, WebRTC) in real data. By leveraging DP-FL through FP-Fed, the authors demonstrate effective, privacy-preserving detector training that matches or surpasses crawl-based models, achieving $\mathrm{AUPRC}=0.98$ at $\varepsilon=1$ on real-user data. The results highlight the need for hybrid, privacy-preserving approaches that integrate real-user data to build robust browser fingerprinting defenses with practical browser deployment implications.

Abstract

Browser fingerprinting is a pervasive online tracking technique used increasingly often for profiling and targeted advertising. Prior research on the prevalence of fingerprinting heavily relied on automated web crawls, which inherently struggle to replicate the nuances of human-computer interactions. This raises concerns about the accuracy of current understandings of real-world fingerprinting deployments. As a result, this paper presents a user study involving 30 participants over 10 weeks, capturing telemetry data from real browsing sessions across 3,000 top-ranked websites. Our evaluation reveals that automated crawls miss almost half (45%) of the fingerprinting websites encountered by real users. This discrepancy mainly stems from the crawlers' inability to access authentication-protected pages, circumvent bot detection, and trigger fingerprinting scripts activated by specific user interactions. We also identify potential new fingerprinting vectors present in real user data but absent from automated crawls. Finally, we evaluate the effectiveness of federated learning for training browser fingerprinting detection models on real user data, yielding improved performance than models trained solely on automated crawl data.

Beyond the Crawl: Unmasking Browser Fingerprinting in Real User Interactions

TL;DR

The study shows that automated crawls substantially underestimate browser fingerprinting observed in real-user interactions, with 45% of fingerprinting websites missed in crawls. It collects real-user telemetry from 30 participants across 3,000 top-ranked sites over 10 weeks and compares it to an automated crawl, revealing higher fingerprinting prevalence and richer vectors (e.g., AudioContext, WebRTC) in real data. By leveraging DP-FL through FP-Fed, the authors demonstrate effective, privacy-preserving detector training that matches or surpasses crawl-based models, achieving at on real-user data. The results highlight the need for hybrid, privacy-preserving approaches that integrate real-user data to build robust browser fingerprinting defenses with practical browser deployment implications.

Abstract

Browser fingerprinting is a pervasive online tracking technique used increasingly often for profiling and targeted advertising. Prior research on the prevalence of fingerprinting heavily relied on automated web crawls, which inherently struggle to replicate the nuances of human-computer interactions. This raises concerns about the accuracy of current understandings of real-world fingerprinting deployments. As a result, this paper presents a user study involving 30 participants over 10 weeks, capturing telemetry data from real browsing sessions across 3,000 top-ranked websites. Our evaluation reveals that automated crawls miss almost half (45%) of the fingerprinting websites encountered by real users. This discrepancy mainly stems from the crawlers' inability to access authentication-protected pages, circumvent bot detection, and trigger fingerprinting scripts activated by specific user interactions. We also identify potential new fingerprinting vectors present in real user data but absent from automated crawls. Finally, we evaluate the effectiveness of federated learning for training browser fingerprinting detection models on real user data, yielding improved performance than models trained solely on automated crawl data.

Paper Structure

This paper contains 21 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Distribution of unique URLs visited by each study participant.
  • Figure 2: Percentage of fingerprinting websites for each website category.
  • Figure 3: Percentage of fingerprinting websites undetected by the automated crawler in each website category.
  • Figure 4: An overview of FP-Fed annamalai2024fp: (1) Participants browse websites and collect script execution data. (2) For each round of training, the server sends the previous round's global model parameters to a few selected participants. (3) These selected participants train a local model based on the collected local script execution data, and (4) send the local model updates to the server, (5) which aggregates them and adds differentially private noise to the aggregate to form the global model of the current round. Steps (2) to (5) are then repeated until the model converges.
  • Figure 5: HIT uploaded to MTurk. To ease readability, a significant portion of the consent form has been left out.

Theorems & Definitions (1)

  • definition 1: Differential Privacy (DP)