Programmable Dataflows: Abstraction and Programming Model for Data Sharing
Siyuan Xia, Chris Zhu, Tapan Srivastava, Bridget Fahey, Raul Castro Fernandez
TL;DR
This work addresses the absence of abstractions in data sharing by introducing programmable dataflows built on a contract abstraction and a Contract Programming Model (CPM). It formalizes data sharing as a sequence of dataflows between agents, with goals and constraints, and allows pre-execution evaluation of dataflows through contracts that specify what data is shared, who approves, and under what conditions. CPM enables developers to implement complex sharing problems using annotated functions, supports delegated execution via data escrows, and includes optimizations such as saving intermediate data elements and short-circuiting to improve efficiency. The evaluation demonstrates the expressiveness of the contract abstraction across canonical sharing patterns, the practical effectiveness of the CPM over alternative technologies, and significant performance gains from the proposed optimizations, highlighting the potential for scalable, compliant data sharing in fraud detection, health data sharing, and ad matching.
Abstract
Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. The lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation. In this paper, we first introduce a data sharing model to represent every data sharing problem with a sequence of dataflows. From the model, we distill an abstraction, the contract, which agents use to communicate the intent of a dataflow and evaluate its consequences, before the dataflow takes place. This helps agents move towards a common sharing goal without violating any regulatory and privacy constraints. Then, we design and implement the contract programming model (CPM), which allows agents to program data sharing applications catered to each problem's needs. Contracts permit data sharing, but their interactive nature may introduce inefficiencies. To mitigate those inefficiencies, we extend the CPM so that it can save intermediate outputs of dataflows, and skip computation if a dataflow tries to access data that it does not have access to. In our evaluation, we show that 1) the contract abstraction is general enough to represent a wide range of sharing problems, 2) we can write programs for complex data sharing problems and exhibit qualitative improvements over other alternate technologies, and 3) quantitatively, our optimizations make sharing programs written with the CPM efficient.
