Table of Contents
Fetching ...

WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

Nathan Zhao

Abstract

Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted inference exposes user screenshots. Detecting personally identifiable information in web screenshots is critical for privacy-preserving deployment, but no public benchmark exists for this task. We introduce WebPII, a fine-grained synthetic benchmark of 44,865 annotated e-commerce UI images designed with three key properties: extended PII taxonomy including transaction-level identifiers that enable reidentification, anticipatory detection for partially-filled forms where users are actively entering data, and scalable generation through VLM-based UI reproduction. Experiments validate that these design choices improve layout-invariant detection across diverse interfaces and generalization to held-out page types. We train WebRedact to demonstrate practical utility, more than doubling text-extraction baseline accuracy (0.753 vs 0.357 mAP@50) at real-time CPU latency (20ms). We release the dataset and model to support privacy-preserving computer use research.

WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

Abstract

Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted inference exposes user screenshots. Detecting personally identifiable information in web screenshots is critical for privacy-preserving deployment, but no public benchmark exists for this task. We introduce WebPII, a fine-grained synthetic benchmark of 44,865 annotated e-commerce UI images designed with three key properties: extended PII taxonomy including transaction-level identifiers that enable reidentification, anticipatory detection for partially-filled forms where users are actively entering data, and scalable generation through VLM-based UI reproduction. Experiments validate that these design choices improve layout-invariant detection across diverse interfaces and generalization to held-out page types. We train WebRedact to demonstrate practical utility, more than doubling text-extraction baseline accuracy (0.753 vs 0.357 mAP@50) at real-time CPU latency (20ms). We release the dataset and model to support privacy-preserving computer use research.
Paper Structure (44 sections, 9 figures, 16 tables)

This paper contains 44 sections, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Sample images from WebPII, rendered with different injected data. The dataset captures the visual complexity of e-commerce interfaces: variable page heights reflecting diverse checkout flows and product displays (compare compact cart in (a) with extended layout in (d)), input fields and dropdown selectors, modal overlays with backdrops that occlude underlying content, ad hoc identifying information such as gift messages (c) and proposed pickup locations (e), and derived values requiring computation of taxes and totals. Bounding boxes respect occlusion boundaries. Pink indicates product annotations, purple denotes empty input fields, and red identifies PII.
  • Figure 2: Data injection maps configuration values to rendered UI elements. Left: annotated screenshot with bounding boxes. Right: selected subset of data injected for this page---Faker-generated PII, ABO product data, LLM-extracted metadata, and values derived at render time. The same layout rendered with different configurations produces diverse training examples with automatic annotations.
  • Figure 3: Form fill states for anticipatory detection. (a) Partial: mid-entry state with later fields incomplete (city field shows "New M" mid-typing). (b) Empty: pristine form with placeholder text and input field annotations. Yellow indicates partially filled fields; grey denotes empty fields.
  • Figure 4: Dataset composition and statistics. (a) Distribution across form-fill variants---empty forms (13.4%), fully-filled forms (22.7%), and partial-fill states (63.9%)---enabling anticipatory detection training. (b) Annotation density distribution with median of 19 boxes per image (mean 22.1), ranging from 0 to 145 annotations per image. (c) Breakdown of all 9 annotation classes, with address (25.7%), order info (23.0%), and product text (18.8%) dominating. PII classes (red) comprise 52.4% of annotations, while non-PII classes (blue) comprise 47.6%. (d) HTML element type distribution, showing most annotations target rendered text (78.1%) versus input fields (13.6%) and images (8.3%).
  • Figure 5: Distribution of base images across 10 e-commerce companies. Apple and Amazon have the most coverage (1,400 images each), while Slack and Ulta Beauty represent smaller verticals (300 images each). Total base images: 10,200 (multiplied across variants to produce 44,865 total dataset images).
  • ...and 4 more figures