Table of Contents
Fetching ...

Data Guard: A Fine-grained Purpose-based Access Control System for Large Data Warehouses

Khai Tran, Sudarshan Vasudevan, Pratham Desai, Alex Gorelik, Mayank Ahuja, Athrey Yadatore Venkateshababu, Mohit Verma, Dichao Hu, Walaa Eldin Moustafa, Vasanth Rajamani, Ankit Gupta, Issac Buenrostro, Kalinda Raina

TL;DR

Data Guard addresses the challenge of compliant data usage in large data warehouses by implementing fine-grained, purpose-based access control that masks data at sub-cell levels based on semantic data labels. It uses a domain-specific policy language to define rules, compiles these into purpose-specific, schema-preserving data-masking views, and transparently routes accesses to the appropriate views via a routing layer called ViewShift. The system is engine-agnostic and tested on Spark and Trino, with production deployment at LinkedIn showing manageable overhead (average ~13.2% CPU per application) across thousands of queries, and large-scale policy coverage (50+ purposes, 110+ policies across 4k tables and 13k views). Data Guard’s practical contributions include the policy compiler, consent-bitmaps optimization, sub-cell masking via a field-path DSL, and cross-engine support, enabling compliant analytics while preserving data utility.

Abstract

The last few years have witnessed a spate of data protection regulations in conjunction with an ever-growing appetite for data usage in large businesses, which presents significant challenges for businesses to maintain compliance. To address this conflict, we present Data Guard - a fine-grained, purpose-based access control system for large data warehouses. Data Guard enables authoring policies based on semantic descriptions of data and purpose of data access. Data Guard then translates these policies into SQL views that mask data from the underlying warehouse tables. At access time, Data Guard ensures compliance by transparently routing each table access to the appropriate data-masking view based on the purpose of the access, thus minimizing the effort of adopting Data Guard in existing applications. Our enforcement solution allows masking data at much finer granularities than what traditional solutions allow. In addition to row and column level data masking, Data Guard can mask data at the sub-cell level for columns with non-atomic data types such as structs, arrays, and maps. This fine-grained masking allows Data Guard to preserve data utility for consumers while ensuring compliance. We implemented a number of performance optimizations to minimize the overhead of data masking operations. We perform numerous experiments to identify the key factors that influence the data masking overhead and demonstrate the efficiency of our implementation. Data Guard is deployed inside LinkedIn's production data warehouses and ensures compliance of more than 20,000 table accesses each day across different data processing engines.

Data Guard: A Fine-grained Purpose-based Access Control System for Large Data Warehouses

TL;DR

Data Guard addresses the challenge of compliant data usage in large data warehouses by implementing fine-grained, purpose-based access control that masks data at sub-cell levels based on semantic data labels. It uses a domain-specific policy language to define rules, compiles these into purpose-specific, schema-preserving data-masking views, and transparently routes accesses to the appropriate views via a routing layer called ViewShift. The system is engine-agnostic and tested on Spark and Trino, with production deployment at LinkedIn showing manageable overhead (average ~13.2% CPU per application) across thousands of queries, and large-scale policy coverage (50+ purposes, 110+ policies across 4k tables and 13k views). Data Guard’s practical contributions include the policy compiler, consent-bitmaps optimization, sub-cell masking via a field-path DSL, and cross-engine support, enabling compliant analytics while preserving data utility.

Abstract

The last few years have witnessed a spate of data protection regulations in conjunction with an ever-growing appetite for data usage in large businesses, which presents significant challenges for businesses to maintain compliance. To address this conflict, we present Data Guard - a fine-grained, purpose-based access control system for large data warehouses. Data Guard enables authoring policies based on semantic descriptions of data and purpose of data access. Data Guard then translates these policies into SQL views that mask data from the underlying warehouse tables. At access time, Data Guard ensures compliance by transparently routing each table access to the appropriate data-masking view based on the purpose of the access, thus minimizing the effort of adopting Data Guard in existing applications. Our enforcement solution allows masking data at much finer granularities than what traditional solutions allow. In addition to row and column level data masking, Data Guard can mask data at the sub-cell level for columns with non-atomic data types such as structs, arrays, and maps. This fine-grained masking allows Data Guard to preserve data utility for consumers while ensuring compliance. We implemented a number of performance optimizations to minimize the overhead of data masking operations. We perform numerous experiments to identify the key factors that influence the data masking overhead and demonstrate the efficiency of our implementation. Data Guard is deployed inside LinkedIn's production data warehouses and ensures compliance of more than 20,000 table accesses each day across different data processing engines.

Paper Structure

This paper contains 33 sections, 2 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: An example highlighting dynamic masking of data to honor member preferences based on purpose yielding different outputs.
  • Figure 2: Data Guard System Architecture.
  • Figure 3: A data-masking view for $ads$ purpose
  • Figure 4: Example of a nested relation with arrays, structs, maps.
  • Figure 5: A schema tree constructed from field paths in Table \ref{['tab:optimization']}.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2