Identifying the potential of sample overlap in evidence synthesis of observational studies

Zhentian Zhang; Tim Friede; Tim Mathes

Identifying the potential of sample overlap in evidence synthesis of observational studies

Zhentian Zhang, Tim Friede, Tim Mathes

TL;DR

This work has developed a method to construct indicators for the degree of sample overlap in evidence synthesis of studies based on existing data, rooted in set theory and based on the coding of the ranges of several well selected sample characteristics.

Abstract

Sample overlap is a common issue in evidence synthesis in the field of medical research, particularly when integrating findings from observational studies utilizing existing databases such as registries. Due to the general inaccessibility of unique identifiers for each observation, addressing sample overlap has been a complex problem, potentially biasing evidence synthesis outcomes and undermining their credibility. We developed a method to construct indicators for the degree of sample overlap in evidence synthesis of studies based on existing data. Our method is rooted in set theory and is based on the coding of the ranges of several well selected sample characteristics, offers a practical solution by focusing on making inference based on sample characteristics rather than on individual participant data. Useful information, such as the overlap-free sample set with the largest sample size in an evidence synthesis, can be derived from this method. We applied our model to several real-world evidence syntheses, demonstrating its effectiveness and flexibility. Our findings highlight the growing importance of addressing sample overlap in evidence synthesis, especially with the increasing relevance of secondary use of data, an area currently under-explored in research.

Identifying the potential of sample overlap in evidence synthesis of observational studies

TL;DR

Abstract

Paper Structure (49 sections, 5 theorems, 30 equations, 14 figures, 6 tables)

This paper contains 49 sections, 5 theorems, 30 equations, 14 figures, 6 tables.

Introduction
Theoretical approach to identify sample overlap in evidence synthesis
Preliminaries
Overlapping data vs longitudinal/clustered data
Overlap as a multivariate-relationship
Formal setup
Notations
Overlap structure
Identifying overlap structure using aggregated information
From individual values to reported envelopes.
Potential of overlap
Interpretation of the potential of overlap.
Why this is a sensible proxy.
Basic properties.
Conservativeness and limitations.
...and 34 more sections

Key Result

Proposition 1

Denote $D_{i,k}:=\{d_{I_{i,j},k}\vert j\in \{1, 2,..., n_i\}\}$ as the set of the values of k-th dimension of the observations in $S_i$. we have $D_{i_1,k} \cap D_{i_2,k} = \emptyset \Rightarrow S_{i_1} \cap S_{i_2} = \emptyset$

Figures (14)

Figure 1: Pairwise overlap can be insufficient to characterize multivariate overlap. Two configurations are shown in which the pairwise intersections $(A\cap B, A\cap C, B\cap C)$ are identical, but the three-way intersection differs: $A\cap B\cap C\neq\emptyset$ in the left configuration, whereas $A'\cap B\cap C=\emptyset$ in the right. This illustrates that overlap among multiple studies cannot, in general, be recovered from pairwise overlap information alone.
Figure 2: Study-combinations of $n=6$ studies. Columns correspond to subsets $A\subseteq\Omega$, rows correspond to studies $S_1,\ldots,S_6$, and a filled entry indicates membership $S_i\in A$. The highlighted block marks the $2^n-n-1$ non-empty combinations of size at least two for which overlap assessment is relevant.
Figure 3: Example overlap structure for four study samples. The Venn diagram visualizes which observations are shared across $S_1,\ldots,S_4$. Shading intensity reflects multiplicity under naive sample-size addition (darker regions correspond to observations that would be counted more times if overlaps were ignored). This example underlies Table \ref{['tab:example_overlap_structure']}, which reports $f(A)$ values for selected study-combinations.
Figure 4: Pairwise overlap-potential heat map for the four-study toy example. Cell $(i,j)$ shows $\tilde{\pi}'_{\mathcal{P}'}(\{S_i,S_j\})$, computed from study-level envelopes of the key characteristics (location and time) at the chosen partition resolution. Values equal to $0$ indicate overlap is excluded under Assumption \ref{['ass:partition_compatibility']}; larger values indicate greater overlap compatibility given the reported envelopes.
Figure 5: Grid plot of non-zero overlap potentials in the four-study toy example (Table \ref{['tab:roo_prime']}). Each column represents a study-combination $A\subseteq\Omega$; filled cells indicate which studies are included in $A$. Columns are ordered by decreasing $\tilde{\pi}'_{\mathcal{P}'}(A)$, and only combinations with $\tilde{\pi}'_{\mathcal{P}'}(A)>0$ are shown.
...and 9 more figures

Theorems & Definitions (19)

Definition 1: Intrinsic characteristic vector
Remark
Definition 2: Overlap set and proportion of overlap
Definition 3: Overlap structure
Proposition 1: Exclusion of pair-wise overlap by exclusion of the sets of one intrinsic characteristic
Remark
Remark
Proposition 2: Exclusion of overlap by exclusion of the ranges of intrinsic characteristic
Proposition 3: Excluding pair-wise overlap based on the range vector of intrinsic characteristics
Theorem 1: Exclusion of overlapping sample combination
...and 9 more

Identifying the potential of sample overlap in evidence synthesis of observational studies

TL;DR

Abstract

Identifying the potential of sample overlap in evidence synthesis of observational studies

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (19)