Browsing behavior exposes identities on the Web

Marcos Oliveira; Junran Yang; Daniel Griffiths; Denis Bonnay; Juhi Kulshrestha

Browsing behavior exposes identities on the Web

Marcos Oliveira, Junran Yang, Daniel Griffiths, Denis Bonnay, Juhi Kulshrestha

TL;DR

It is shown that when people navigate the Web, their online browsing traces produce fingerprints that identify them, and that what is perhaps the most basic feature of their online habits—namely, which sites the authors visit most often—is highly unique.

Abstract

How easy is it to uniquely identify a person based solely on their web browsing behavior? Here we show that when people navigate the Web, their online traces produce fingerprints that identify them. Merely the four most visited web domains are enough to identify 95% of the individuals. These digital fingerprints are stable and render high re-identifiability. We demonstrate that we can re-identify 80% of the individuals in separate time slices of data. Such a privacy threat persists even with limited information about individuals' browsing behavior, reinforcing existing concerns around online privacy.

Browsing behavior exposes identities on the Web

TL;DR

Abstract

Paper Structure (13 sections, 1 figure)

This paper contains 13 sections, 1 figure.

Introduction
Results
Discussion
Methods
Data Availability
Ethics declarations
Acknowledgments

Figures (1)

Figure 1: People’s habitually visited websites serve as fingerprints that distinguish them.(a) The percentage of users with unique most-visited domains list (i.e., fingerprint) with varying fingerprint length. Almost all users have unique four-domain fingerprints (i.e., four most visited domains). (b) The percentage of users with unique four-domain fingerprints grouped by age and gender. Regardless of gender and age, a four-domain fingerprint yields high uniqueness. (c) A schematic of a step-by-step user identification via users' habitual websites. For each user, we randomly select a domain from their list of five most-visited domains then group all users sharing the same domain. By selecting additional domains, this process progressively refine groups, until the user is uniquely isolated. In the illustration, bar sizes represent proportional number of users. (d) The distribution of steps to identify a user within our data. While four domains ensure uniqueness, fewer domains are often enough for identification; by following steps in (c), we need an average of $2.45$ steps to identify users. (e) The re-identification rate for different duration of data collection. The fingerprints enable re-identification of most individuals in separate time slices of data. (f) The percentage of unique users in scenarios of fewer tracking domains. By collecting data of a limited number of domains only, the majority of users can still be distinguished. (g) The percentage of unique users in scenarios of fewer tracking domains, with fixed fingerprint lengths. Collecting data from more domains yields higher uniqueness, but with decreasing returns.

Browsing behavior exposes identities on the Web

TL;DR

Abstract

Browsing behavior exposes identities on the Web

Authors

TL;DR

Abstract

Table of Contents

Figures (1)