Table of Contents
Fetching ...

A Goemans-Williamson type algorithm for identifying subcohorts in clinical trials

Pratik Worah

TL;DR

This work addresses identifying homogeneous patient subcohorts by designing sparse, interpretable tests. It casts subcohort discovery as a MAX-CUT–like optimization and solves it via a semidefinite programming (SDP) relaxation with a Goemans–Williamson–style rounding, achieving an approximation factor of roughly $0.82$. The method is empirically applied to the METABRIC breast cancer dataset, revealing subcohorts with meaningful metastatic enrichment and uncovering associations between methylation changes and nuclear receptor expression, including a subcohort suggesting LXRB as a therapeutic target. The results demonstrate favorable sensitivity–specificity–sparsity trade-offs compared with PRIM and highlight potential clinical pathways for targeted interventions in breast cancer, while noting the need for clinical validation and consideration of limitations.

Abstract

We design an efficient algorithm that outputs tests for identifying predominantly homogeneous subcohorts of patients from large in-homogeneous datasets. Our theoretical contribution is a rounding technique, similar to that of Goemans and Wiliamson (1995), that approximates the optimal solution within a factor of $0.82$. As an application, we use our algorithm to trade-off sensitivity for specificity to systematically identify clinically interesting homogeneous subcohorts of patients in the RNA microarray dataset for breast cancer from Curtis et al. (2012). One such clinically interesting subcohort suggests a link between LXR over-expression and BRCA2 and MSH6 methylation levels for patients in that subcohort.

A Goemans-Williamson type algorithm for identifying subcohorts in clinical trials

TL;DR

This work addresses identifying homogeneous patient subcohorts by designing sparse, interpretable tests. It casts subcohort discovery as a MAX-CUT–like optimization and solves it via a semidefinite programming (SDP) relaxation with a Goemans–Williamson–style rounding, achieving an approximation factor of roughly . The method is empirically applied to the METABRIC breast cancer dataset, revealing subcohorts with meaningful metastatic enrichment and uncovering associations between methylation changes and nuclear receptor expression, including a subcohort suggesting LXRB as a therapeutic target. The results demonstrate favorable sensitivity–specificity–sparsity trade-offs compared with PRIM and highlight potential clinical pathways for targeted interventions in breast cancer, while noting the need for clinical validation and consideration of limitations.

Abstract

We design an efficient algorithm that outputs tests for identifying predominantly homogeneous subcohorts of patients from large in-homogeneous datasets. Our theoretical contribution is a rounding technique, similar to that of Goemans and Wiliamson (1995), that approximates the optimal solution within a factor of . As an application, we use our algorithm to trade-off sensitivity for specificity to systematically identify clinically interesting homogeneous subcohorts of patients in the RNA microarray dataset for breast cancer from Curtis et al. (2012). One such clinically interesting subcohort suggests a link between LXR over-expression and BRCA2 and MSH6 methylation levels for patients in that subcohort.

Paper Structure

This paper contains 16 sections, 4 theorems, 19 equations, 5 figures, 2 algorithms.

Key Result

Theorem 3.2

Let $S^*=(S^*_U,S^*_V)$ denote the optimal subcohort such that $|S_U^*|\ll |U|$, then Algorithm algmain computes a subcohort $S=(S_U,S_V)$, such that $|S_U|\ge \alpha|S^*_U|$, where the constant $\alpha\simeq 0.82$. Equivalently, $\frac{|S_U|}{|U|+|V|}\ge \alpha\cdot\frac{|S^*_U|}{|U|+|V|}$, i.e., t

Figures (5)

  • Figure 1: Sensitivity vs Specificity trade-off using Algorithm \ref{['algmain']}, for the dataset metabric. Left: Algorithm \ref{['algmain']} outputs a single test hyperplane that identifies subcohorts with 57% specificity, for sparsity parameter $s=6$. Right: When we use the intersection of subcohorts using two tests, the specificity increases to about 70%, but identified subcohort size (sensitivity) decreases to a third, with $s=6$. Note: the single test in the left figure uses about 15 nuclear receptors to identify subcohorts, while the two tests in the right figure use about 45 nuclear receptors.
  • Figure 2: Gene expression and methylation percentages for metastasis and non-metastasis patients computed over the full dataset. Observe that averaging over the entire dataset, consisting of non-homogeneous patient population, leads to large confidence intervals. The latter make it impossible to distinguish between metastasis and non-metastasis using z-scores based on the nuclear receptors and breast cancer related genes used above.
  • Figure 3: Associating changes in methylation percentages with changes in nuclear receptors using Algorithm \ref{['algmain']}. Compared to averaging over the entire population (see Figure \ref{['fig:poplevel']}), averaging over patients in a subcohort can lead to statistically significant differences between metastasis and non-metastasis breast cancers. In particular, in the above subcohort of patients, BRCA2 and MSH6 have significantly higher methylation percentages than the entire cohort of the patients in METABRIC dataset; and at the same time they have a significantly higher expression of NR1H2 receptors responsible for lipid metabolism.
  • Figure 4: Lower bound on specificity as a function of $\kappa$ and $\frac{|V|}{|U|}$, based on Theorem \ref{['thm:spec']}.
  • Figure 5: Plot specificity vs k (sparsity) as sensitivity is held above a lower bound for the variant of PRIM algorithm used for comparison. The dashed Red line (in both figures) denotes the specificity for the MAX-CUT algorithm, from Figure \ref{['fig:comp1']}, for the given sensitivities ($0.3$ and $0.1$), when the number of genes used is 15 (L figure) and 45 (R figure). In both cases, the PRIM local search heuristic fails to approach the specificity of Algorithm \ref{['algmain']}.

Theorems & Definitions (5)

  • Remark 3.1
  • Theorem 3.2: Sensitivity
  • Theorem 3.3: Specificity
  • Theorem 3.4
  • Lemma 8.1