Table of Contents
Fetching ...

Private DNA Sequencing: Hiding Information in Discrete Noise

Kayvon Mazooji, Roy Dong, Ilan Shomorony

TL;DR

Upper and lower bounds to the solution of the problem of hiding a binary random variable X (a genetic marker) with the additive noise provided by mixing DNA samples are characterized, using mutual information as a privacy metric.

Abstract

When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume that the genetic content of these samples is known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but still allow the individual to recover their own DNA samples afterward. Motivated by this idea, we study the problem of hiding a binary random variable $X$ (a genetic marker) with the additive noise provided by mixing DNA samples, using mutual information as a privacy metric. This is equivalent to the problem of finding a worst-case noise distribution for recovering $X$ from the noisy observation among a set of feasible discrete distributions. We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close. The lower bound is obtained through a convex relaxation of the original discrete optimization problem, and yields a closed-form expression. The upper bound is computed via a greedy algorithm for selecting the mixing proportions.

Private DNA Sequencing: Hiding Information in Discrete Noise

TL;DR

Upper and lower bounds to the solution of the problem of hiding a binary random variable X (a genetic marker) with the additive noise provided by mixing DNA samples are characterized, using mutual information as a privacy metric.

Abstract

When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume that the genetic content of these samples is known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but still allow the individual to recover their own DNA samples afterward. Motivated by this idea, we study the problem of hiding a binary random variable (a genetic marker) with the additive noise provided by mixing DNA samples, using mutual information as a privacy metric. This is equivalent to the problem of finding a worst-case noise distribution for recovering from the noisy observation among a set of feasible discrete distributions. We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close. The lower bound is obtained through a convex relaxation of the original discrete optimization problem, and yields a closed-form expression. The upper bound is computed via a greedy algorithm for selecting the mixing proportions.

Paper Structure

This paper contains 13 sections, 68 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: (a) In order to hide her genotype $X$ at a given locus $s$, Alice mixes her DNA sample with that of $K$ individuals in amounts $\alpha_1,...,\alpha_K$. Upon receiving the sequencing data from the lab, Alice can remove the contribution from the "noise individuals" (whose genotype at $s$ is known) to recover $X$. (b) Depending on the choice of the mixing coefficients $\alpha_0,\dots,\alpha_K$, very different channels are created between $X$ and the lab's observation $Y$. For $K=2$, the scheme $\alpha_0 = \alpha_1 = \alpha_2 = 1/3$ leads to the channel on the left, while the scheme $\alpha_0 = \alpha_1 = 1/4$, $\alpha_2 = 1/2$ leads to the channel on the right. Notice that even the output alphabet changes as a function of the mixing coefficients.
  • Figure 2: Optimal value of (\ref{['eq:main2']}) for integral $\alpha_i$s compared to the uniform scheme and the binary scheme, for $K=5$ and $p \in [0,0.5]$. At $p = 0.5,$ the optimal scheme is $\alpha = [1,1,2,4,8,16].$ At $p = 0.25,$ the optimal scheme is $\alpha = [1,1,1,2,3,4].$ At $p = 0.01,$ the optimal scheme is $\alpha = [1,1,1,1,1,1].$
  • Figure 3: Comparison between the lower bound from (\ref{['eq:mainthm']}) and the upper bound provided by the greedy algorithm for $K = 15$.
  • Figure 4: Comparison of the pmf of $Z_\alpha$ produced by the greedy algorithm ($\alpha = [1, 1, 1, 1, 2, 2, 3, 4, 5, 6, 7, 8, 10, 12, 16, 19]$) and the (truncated) Geometric pmf in the lower bound (\ref{['eq:mainthm']}), for $K=15$ and $p=0.25$.
  • Figure 5: Comparison between the lower bound from (\ref{['eq:mainthm']}) and all upper bounds discussed in the paper for $K = 15$. Observe that the upper bound corresponding to the $K$-binary-linear-uniform scheme captures the general shape of the lower bound.
  • ...and 2 more figures