Mechanisms for Data Sharing in Collaborative Causal Inference (Extended Version)

Björn Filter; Ralf Möller; Özgür Lütfü Özçep

Mechanisms for Data Sharing in Collaborative Causal Inference (Extended Version)

Björn Filter, Ralf Möller, Özgür Lütfü Özçep

TL;DR

This paper tackles incentivizing data sharing for collaborative causal inference (CCI) by introducing a CPDAG-focused data valuation framework. It extends the distribution SID (dSID) to quantify how coalitions change intervention-distribution estimates and combines it with a KL-based term to form a unified valuation $v(\mathcal{E}, \mathcal{B})$, using the grand coalition as a population benchmark. Two mechanisms are proposed: a data-maximizing mechanism that sustains data production until supply just covers costs (with $\epsilon \to 0^+$), and a fair mechanism based on a $\rho$-Shapley reward that prioritizes higher-quality data while preserving fairness. The work integrates causal structure learning, multi-agent incentives, and fair allocation to enable scalable, high-quality causal estimations across self-interested participants, while noting assumptions about known costs and non-malicious behavior as avenues for future work.

Abstract

Collaborative causal inference (CCI) is a federated learning method for pooling data from multiple, often self-interested, parties, to achieve a common learning goal over causal structures, e.g. estimation and optimization of treatment variables in a medical setting. Since obtaining data can be costly for the participants and sharing unique data poses the risk of losing competitive advantages, motivating the participation of all parties through equitable rewards and incentives is necessary. This paper devises an evaluation scheme to measure the value of each party's data contribution to the common learning task, tailored to causal inference's statistical demands, by comparing completed partially directed acyclic graphs (CPDAGs) inferred from observational data contributed by the participants. The Data Valuation Scheme thus obtained can then be used to introduce mechanisms that incentivize the agents to contribute data. It can be leveraged to reward agents fairly, according to the quality of their data, or to maximize all agents' data contributions.

Mechanisms for Data Sharing in Collaborative Causal Inference (Extended Version)

TL;DR

, using the grand coalition as a population benchmark. Two mechanisms are proposed: a data-maximizing mechanism that sustains data production until supply just covers costs (with

), and a fair mechanism based on a

-Shapley reward that prioritizes higher-quality data while preserving fairness. The work integrates causal structure learning, multi-agent incentives, and fair allocation to enable scalable, high-quality causal estimations across self-interested participants, while noting assumptions about known costs and non-malicious behavior as avenues for future work.

Abstract

Paper Structure (15 sections, 6 theorems, 22 equations, 1 figure, 1 algorithm)

This paper contains 15 sections, 6 theorems, 22 equations, 1 figure, 1 algorithm.

Introduction
Preliminaries
Collaborative Causal Inference
Data Valuation Scheme
Modeling an Individual Agent
Modeling Multiple Agents
Standard Collaborative Setting
Data Maximizing Mechanism
Achieving Fairness
Related Work
Conclusion and Future Work
Distribution Structural Intervention Distance
Further Preliminaries
An Algorithm for Computing the dSID
Proof of Theorem \ref{['theo1']}

Key Result

theorem thmcountertheorem

The mechanism $\mathcal{M}$ defined by equation datamax is data-maximizing for $\epsilon \rightarrow 0^+$. A rational agent $i$ will contribute $\Delta_it_i^{opt}$ data points where $\Delta_it_i^{opt} \geq \Delta_it_i^{s-opt}$, yielding a total of $\sum_{j \in N} \Delta_jt_j^{opt}$ data points.

Figures (1)

Figure 1: dSID

Theorems & Definitions (13)

definition thmcounterdefinition: Distribution Structural Intervention Distance (dSID)
definition thmcounterdefinition: Optimal Data Production
definition thmcounterdefinition: Data Maximization
theorem thmcountertheorem: Data maximization with known costs
proposition thmcounterproposition
definition thmcounterdefinition: Adjustment Criterion for DAGs Shpitser2010
theorem thmcountertheorem: Shpitser2010
theorem thmcountertheorem
proof
theorem \ref{theo1}: Data maximization with known costs
...and 3 more

Mechanisms for Data Sharing in Collaborative Causal Inference (Extended Version)

TL;DR

Abstract

Mechanisms for Data Sharing in Collaborative Causal Inference (Extended Version)

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (13)