Table of Contents
Fetching ...

SINBAD: Saliency-informed detection of breakage caused by ad blocking

Saiid El Hajj Chehade, Sandra Siby, Carmela Troncoso

TL;DR

SINBAD tackles breakage caused by privacy-preserving filter lists by training on user-reported issues and leveraging web saliency to drive targeted interactions. The method combines three innovations—forum-derived ground truth, saliency-informed crawling, and subtree-focused differential analysis—to detect breakage, including dynamic and CSS-based cases, with a reported $20\%$ accuracy improvement over prior work. It demonstrates high discrimination at the subtree level and scalable evaluation across multiple datasets, and shows strong generalization in open-world tests. The practical impact is a proactive tool for maintainers to test new rules before deployment, reducing user-friction and improving the reliability of blocking tools.

Abstract

Privacy-enhancing blocking tools based on filter-list rules tend to break legitimate functionality. Filter-list maintainers could benefit from automated breakage detection tools that allow them to proactively fix problematic rules before deploying them to millions of users. We introduce SINBAD, an automated breakage detector that improves the accuracy over the state of the art by 20%, and is the first to detect dynamic breakage and breakage caused by style-oriented filter rules. The success of SINBAD is rooted in three innovations: (1) the use of user-reported breakage issues in forums that enable the creation of a high-quality dataset for training in which only breakage that users perceive as an issue is included; (2) the use of 'web saliency' to automatically identify user-relevant regions of a website on which to prioritize automated interactions aimed at triggering breakage; and (3) the analysis of webpages via subtrees which enables fine-grained identification of problematic filter rules.

SINBAD: Saliency-informed detection of breakage caused by ad blocking

TL;DR

SINBAD tackles breakage caused by privacy-preserving filter lists by training on user-reported issues and leveraging web saliency to drive targeted interactions. The method combines three innovations—forum-derived ground truth, saliency-informed crawling, and subtree-focused differential analysis—to detect breakage, including dynamic and CSS-based cases, with a reported accuracy improvement over prior work. It demonstrates high discrimination at the subtree level and scalable evaluation across multiple datasets, and shows strong generalization in open-world tests. The practical impact is a proactive tool for maintainers to test new rules before deployment, reducing user-friction and improving the reliability of blocking tools.

Abstract

Privacy-enhancing blocking tools based on filter-list rules tend to break legitimate functionality. Filter-list maintainers could benefit from automated breakage detection tools that allow them to proactively fix problematic rules before deploying them to millions of users. We introduce SINBAD, an automated breakage detector that improves the accuracy over the state of the art by 20%, and is the first to detect dynamic breakage and breakage caused by style-oriented filter rules. The success of SINBAD is rooted in three innovations: (1) the use of user-reported breakage issues in forums that enable the creation of a high-quality dataset for training in which only breakage that users perceive as an issue is included; (2) the use of 'web saliency' to automatically identify user-relevant regions of a website on which to prioritize automated interactions aimed at triggering breakage; and (3) the analysis of webpages via subtrees which enables fine-grained identification of problematic filter rules.
Paper Structure (40 sections, 5 equations, 13 figures, 6 tables)

This paper contains 40 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Number of issues that we can process and have alive test URLs, annotated with their proportion from all issues created in that year.
  • Figure 2: Plot showing the number of reproducible issues according to the duration between the time when the issue was created and when we evaluated whether it can be reproduced.
  • Figure 3: Overview of SINBAD. The pipeline consists of three phases: (1) Saliency-informed crawling: SINBAD detects salient elements on the page, and runs three crawls -- with no filter lists, broken filter lists, and fixed filter lists. Crawls execute interactions with salient elements to trigger dynamic breakage. (2) Differential subtree creation: SINBAD uses the changes in the page's DOM tree between pairs of crawls to create differential subtrees. (3) Subtree classification: SINBAD extracts features and labels from the subtrees to train a classifier that can classify subtrees as broken or not.
  • Figure 4: Subtree extraction example for the difference going from visit $A$ (top left) to visit $B$ (bottom left). We can see the common tree in blue (top right) $T_{A,B}$ and the differential subtrees set $\Delta_{A,B}$ (bottom right).
  • Figure 5: Decision tree to label the subtree given the ground truth origin of the visits ($C_F$: visit with fixing filter list, $C_B$: visit with breaking filter list, and $C_N$: visit with no filter lists). The labeling also depends on what happened to the subtree between the two visits in question (ADDED, REMOVED or EDITED)
  • ...and 8 more figures