Mechanisms for Data Sharing in Collaborative Causal Inference (Extended Version)
Björn Filter, Ralf Möller, Özgür Lütfü Özçep
TL;DR
This paper tackles incentivizing data sharing for collaborative causal inference (CCI) by introducing a CPDAG-focused data valuation framework. It extends the distribution SID (dSID) to quantify how coalitions change intervention-distribution estimates and combines it with a KL-based term to form a unified valuation $v(\mathcal{E}, \mathcal{B})$, using the grand coalition as a population benchmark. Two mechanisms are proposed: a data-maximizing mechanism that sustains data production until supply just covers costs (with $\epsilon \to 0^+$), and a fair mechanism based on a $\rho$-Shapley reward that prioritizes higher-quality data while preserving fairness. The work integrates causal structure learning, multi-agent incentives, and fair allocation to enable scalable, high-quality causal estimations across self-interested participants, while noting assumptions about known costs and non-malicious behavior as avenues for future work.
Abstract
Collaborative causal inference (CCI) is a federated learning method for pooling data from multiple, often self-interested, parties, to achieve a common learning goal over causal structures, e.g. estimation and optimization of treatment variables in a medical setting. Since obtaining data can be costly for the participants and sharing unique data poses the risk of losing competitive advantages, motivating the participation of all parties through equitable rewards and incentives is necessary. This paper devises an evaluation scheme to measure the value of each party's data contribution to the common learning task, tailored to causal inference's statistical demands, by comparing completed partially directed acyclic graphs (CPDAGs) inferred from observational data contributed by the participants. The Data Valuation Scheme thus obtained can then be used to introduce mechanisms that incentivize the agents to contribute data. It can be leveraged to reward agents fairly, according to the quality of their data, or to maximize all agents' data contributions.
