Table of Contents
Fetching ...

Programmable Dataflows: Abstraction and Programming Model for Data Sharing

Siyuan Xia, Chris Zhu, Tapan Srivastava, Bridget Fahey, Raul Castro Fernandez

TL;DR

This work addresses the absence of abstractions in data sharing by introducing programmable dataflows built on a contract abstraction and a Contract Programming Model (CPM). It formalizes data sharing as a sequence of dataflows between agents, with goals and constraints, and allows pre-execution evaluation of dataflows through contracts that specify what data is shared, who approves, and under what conditions. CPM enables developers to implement complex sharing problems using annotated functions, supports delegated execution via data escrows, and includes optimizations such as saving intermediate data elements and short-circuiting to improve efficiency. The evaluation demonstrates the expressiveness of the contract abstraction across canonical sharing patterns, the practical effectiveness of the CPM over alternative technologies, and significant performance gains from the proposed optimizations, highlighting the potential for scalable, compliant data sharing in fraud detection, health data sharing, and ad matching.

Abstract

Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. The lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation. In this paper, we first introduce a data sharing model to represent every data sharing problem with a sequence of dataflows. From the model, we distill an abstraction, the contract, which agents use to communicate the intent of a dataflow and evaluate its consequences, before the dataflow takes place. This helps agents move towards a common sharing goal without violating any regulatory and privacy constraints. Then, we design and implement the contract programming model (CPM), which allows agents to program data sharing applications catered to each problem's needs. Contracts permit data sharing, but their interactive nature may introduce inefficiencies. To mitigate those inefficiencies, we extend the CPM so that it can save intermediate outputs of dataflows, and skip computation if a dataflow tries to access data that it does not have access to. In our evaluation, we show that 1) the contract abstraction is general enough to represent a wide range of sharing problems, 2) we can write programs for complex data sharing problems and exhibit qualitative improvements over other alternate technologies, and 3) quantitatively, our optimizations make sharing programs written with the CPM efficient.

Programmable Dataflows: Abstraction and Programming Model for Data Sharing

TL;DR

This work addresses the absence of abstractions in data sharing by introducing programmable dataflows built on a contract abstraction and a Contract Programming Model (CPM). It formalizes data sharing as a sequence of dataflows between agents, with goals and constraints, and allows pre-execution evaluation of dataflows through contracts that specify what data is shared, who approves, and under what conditions. CPM enables developers to implement complex sharing problems using annotated functions, supports delegated execution via data escrows, and includes optimizations such as saving intermediate data elements and short-circuiting to improve efficiency. The evaluation demonstrates the expressiveness of the contract abstraction across canonical sharing patterns, the practical effectiveness of the CPM over alternative technologies, and significant performance gains from the proposed optimizations, highlighting the potential for scalable, compliant data sharing in fraud detection, health data sharing, and ad matching.

Abstract

Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. The lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation. In this paper, we first introduce a data sharing model to represent every data sharing problem with a sequence of dataflows. From the model, we distill an abstraction, the contract, which agents use to communicate the intent of a dataflow and evaluate its consequences, before the dataflow takes place. This helps agents move towards a common sharing goal without violating any regulatory and privacy constraints. Then, we design and implement the contract programming model (CPM), which allows agents to program data sharing applications catered to each problem's needs. Contracts permit data sharing, but their interactive nature may introduce inefficiencies. To mitigate those inefficiencies, we extend the CPM so that it can save intermediate outputs of dataflows, and skip computation if a dataflow tries to access data that it does not have access to. In our evaluation, we show that 1) the contract abstraction is general enough to represent a wide range of sharing problems, 2) we can write programs for complex data sharing problems and exhibit qualitative improvements over other alternate technologies, and 3) quantitatively, our optimizations make sharing programs written with the CPM efficient.
Paper Structure (24 sections, 3 equations, 7 figures, 3 tables)

This paper contains 24 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of three consecutive data sharing states.
  • Figure 2: Examples of how contracts enable data sharing in the fraud detection example.
  • Figure 3: Sharing states separated by the contract execution
  • Figure 4: Program snippet for healthcare sharing. upload_data_with_CPR ensures all agents' data includes a "CPR" column. run_causal_query uses Python's dowhy package to calculate causal effect. upload_cmr uses the default implementation from the CPM. DNPR call this to automatically allow agents to run run_causal_query over their data.
  • Figure 5: Program snippet for ad matching. propose_contract and approve_contract allows contracts to be proposed and approved one at a time. train_advertising_modelallows reusing the saved joined result from Facebook and YouTube's data
  • ...and 2 more figures

Theorems & Definitions (9)

  • Example 1: Financial Fraud Detection
  • Definition 1: Agent
  • Definition 2: Data Element
  • Definition 3: Data Sharing State
  • Definition 4: Goal States
  • Definition 5: Constraint State
  • Definition 6: Dataflow
  • Definition 7: Data Sharing Goal
  • Definition 8: Contract