Table of Contents
Fetching ...

Identifying the potential of sample overlap in evidence synthesis of observational studies

Zhentian Zhang, Tim Friede, Tim Mathes

TL;DR

This work has developed a method to construct indicators for the degree of sample overlap in evidence synthesis of studies based on existing data, rooted in set theory and based on the coding of the ranges of several well selected sample characteristics.

Abstract

Sample overlap is a common issue in evidence synthesis in the field of medical research, particularly when integrating findings from observational studies utilizing existing databases such as registries. Due to the general inaccessibility of unique identifiers for each observation, addressing sample overlap has been a complex problem, potentially biasing evidence synthesis outcomes and undermining their credibility. We developed a method to construct indicators for the degree of sample overlap in evidence synthesis of studies based on existing data. Our method is rooted in set theory and is based on the coding of the ranges of several well selected sample characteristics, offers a practical solution by focusing on making inference based on sample characteristics rather than on individual participant data. Useful information, such as the overlap-free sample set with the largest sample size in an evidence synthesis, can be derived from this method. We applied our model to several real-world evidence syntheses, demonstrating its effectiveness and flexibility. Our findings highlight the growing importance of addressing sample overlap in evidence synthesis, especially with the increasing relevance of secondary use of data, an area currently under-explored in research.

Identifying the potential of sample overlap in evidence synthesis of observational studies

TL;DR

This work has developed a method to construct indicators for the degree of sample overlap in evidence synthesis of studies based on existing data, rooted in set theory and based on the coding of the ranges of several well selected sample characteristics.

Abstract

Sample overlap is a common issue in evidence synthesis in the field of medical research, particularly when integrating findings from observational studies utilizing existing databases such as registries. Due to the general inaccessibility of unique identifiers for each observation, addressing sample overlap has been a complex problem, potentially biasing evidence synthesis outcomes and undermining their credibility. We developed a method to construct indicators for the degree of sample overlap in evidence synthesis of studies based on existing data. Our method is rooted in set theory and is based on the coding of the ranges of several well selected sample characteristics, offers a practical solution by focusing on making inference based on sample characteristics rather than on individual participant data. Useful information, such as the overlap-free sample set with the largest sample size in an evidence synthesis, can be derived from this method. We applied our model to several real-world evidence syntheses, demonstrating its effectiveness and flexibility. Our findings highlight the growing importance of addressing sample overlap in evidence synthesis, especially with the increasing relevance of secondary use of data, an area currently under-explored in research.
Paper Structure (49 sections, 5 theorems, 30 equations, 14 figures, 6 tables)

This paper contains 49 sections, 5 theorems, 30 equations, 14 figures, 6 tables.

Key Result

Proposition 1

Denote $D_{i,k}:=\{d_{I_{i,j},k}\vert j\in \{1, 2,..., n_i\}\}$ as the set of the values of k-th dimension of the observations in $S_i$. we have $D_{i_1,k} \cap D_{i_2,k} = \emptyset \Rightarrow S_{i_1} \cap S_{i_2} = \emptyset$

Figures (14)

  • Figure 1: Pairwise overlap can be insufficient to characterize multivariate overlap. Two configurations are shown in which the pairwise intersections $(A\cap B, A\cap C, B\cap C)$ are identical, but the three-way intersection differs: $A\cap B\cap C\neq\emptyset$ in the left configuration, whereas $A'\cap B\cap C=\emptyset$ in the right. This illustrates that overlap among multiple studies cannot, in general, be recovered from pairwise overlap information alone.
  • Figure 2: Study-combinations of $n=6$ studies. Columns correspond to subsets $A\subseteq\Omega$, rows correspond to studies $S_1,\ldots,S_6$, and a filled entry indicates membership $S_i\in A$. The highlighted block marks the $2^n-n-1$ non-empty combinations of size at least two for which overlap assessment is relevant.
  • Figure 3: Example overlap structure for four study samples. The Venn diagram visualizes which observations are shared across $S_1,\ldots,S_4$. Shading intensity reflects multiplicity under naive sample-size addition (darker regions correspond to observations that would be counted more times if overlaps were ignored). This example underlies Table \ref{['tab:example_overlap_structure']}, which reports $f(A)$ values for selected study-combinations.
  • Figure 4: Pairwise overlap-potential heat map for the four-study toy example. Cell $(i,j)$ shows $\tilde{\pi}'_{\mathcal{P}'}(\{S_i,S_j\})$, computed from study-level envelopes of the key characteristics (location and time) at the chosen partition resolution. Values equal to $0$ indicate overlap is excluded under Assumption \ref{['ass:partition_compatibility']}; larger values indicate greater overlap compatibility given the reported envelopes.
  • Figure 5: Grid plot of non-zero overlap potentials in the four-study toy example (Table \ref{['tab:roo_prime']}). Each column represents a study-combination $A\subseteq\Omega$; filled cells indicate which studies are included in $A$. Columns are ordered by decreasing $\tilde{\pi}'_{\mathcal{P}'}(A)$, and only combinations with $\tilde{\pi}'_{\mathcal{P}'}(A)>0$ are shown.
  • ...and 9 more figures

Theorems & Definitions (19)

  • Definition 1: Intrinsic characteristic vector
  • Remark
  • Definition 2: Overlap set and proportion of overlap
  • Definition 3: Overlap structure
  • Proposition 1: Exclusion of pair-wise overlap by exclusion of the sets of one intrinsic characteristic
  • Remark
  • Remark
  • Proposition 2: Exclusion of overlap by exclusion of the ranges of intrinsic characteristic
  • Proposition 3: Excluding pair-wise overlap based on the range vector of intrinsic characteristics
  • Theorem 1: Exclusion of overlapping sample combination
  • ...and 9 more