Table of Contents
Fetching ...

Error-Tolerant E-Discovery Protocols

Jinshuo Dong, Jason D. Hartline, Liren Shan, Aravindan Vijayaraghavan

TL;DR

This paper tackles multi-party e-discovery with accountability and privacy constraints under non-realizable data, where a perfect linear separator may not exist. It introduces a Label-Verification protocol integrated into the Continuous Active Learning (CAL) framework to verify defendant labels while limiting disclosure of non-responsive documents. The work provides per-call theoretical guarantees in the one-dimensional setting, including recall bound $1-(\mathrm{err}^*+k-1)/N^+$ and non-responsive disclosure trade-offs, plus a lower bound $\Omega(\log N)$ on NRD for high recall. Empirical evaluation on the TREC Matters 201/202 shows recall within about 10% of the baseline with NRD reductions up to 75%, supporting practical applicability of accountable e-discovery protocols.

Abstract

We consider the multi-party classification problem introduced by Dong, Hartline, and Vijayaraghavan (2022) in the context of electronic discovery (e-discovery). Based on a request for production from the requesting party, the responding party is required to provide documents that are responsive to the request except for those that are legally privileged. Our goal is to find a protocol that verifies that the responding party sends almost all responsive documents while minimizing the disclosure of non-responsive documents. We provide protocols in the challenging non-realizable setting, where the instance may not be perfectly separated by a linear classifier. We demonstrate empirically that our protocol successfully manages to find almost all relevant documents, while incurring only a small disclosure of non-responsive documents. We complement this with a theoretical analysis of our protocol in the single-dimensional setting, and other experiments on simulated data which suggest that the non-responsive disclosure incurred by our protocol may be unavoidable.

Error-Tolerant E-Discovery Protocols

TL;DR

This paper tackles multi-party e-discovery with accountability and privacy constraints under non-realizable data, where a perfect linear separator may not exist. It introduces a Label-Verification protocol integrated into the Continuous Active Learning (CAL) framework to verify defendant labels while limiting disclosure of non-responsive documents. The work provides per-call theoretical guarantees in the one-dimensional setting, including recall bound and non-responsive disclosure trade-offs, plus a lower bound on NRD for high recall. Empirical evaluation on the TREC Matters 201/202 shows recall within about 10% of the baseline with NRD reductions up to 75%, supporting practical applicability of accountable e-discovery protocols.

Abstract

We consider the multi-party classification problem introduced by Dong, Hartline, and Vijayaraghavan (2022) in the context of electronic discovery (e-discovery). Based on a request for production from the requesting party, the responding party is required to provide documents that are responsive to the request except for those that are legally privileged. Our goal is to find a protocol that verifies that the responding party sends almost all responsive documents while minimizing the disclosure of non-responsive documents. We provide protocols in the challenging non-realizable setting, where the instance may not be perfectly separated by a linear classifier. We demonstrate empirically that our protocol successfully manages to find almost all relevant documents, while incurring only a small disclosure of non-responsive documents. We complement this with a theoretical analysis of our protocol in the single-dimensional setting, and other experiments on simulated data which suggest that the non-responsive disclosure incurred by our protocol may be unavoidable.
Paper Structure (21 sections, 12 theorems, 27 equations, 6 figures, 1 table, 6 algorithms)

This paper contains 21 sections, 12 theorems, 27 equations, 6 figures, 1 table, 6 algorithms.

Key Result

Theorem 4.1

Given an one-dimensional instance $(R,f)$, failure probability $\delta \in (0,1)$ and error tolerance $k \geq 1$, Label-Verification Protocol for Label Report (Algorithm alg:one_dim_label) satisfies where $N$ is the number of data points in $R$, $N^+, N^-$ are the number of positive points and negative points in $R$ respectively, $\mathrm{err}^*$ is the optimal error.

Figures (6)

  • Figure 1: Compared is the Continuous Active Learning (CAL) method of cormack2014evaluation with two different internal labeling procedures. "Protocol_Classifier" uses the labeling protocol that we develop. "Reveal_All" provides the naive baseline of the labeling protocol where all documents that require hand labels are provided to the plaintiff. "Protocol_Classifier" and the baseline "Reveal_All" are both accountable CAL-based protocols, which means they do not rely on the honesty of the defendant. We compare the recall and non-responsive disclosure of these two CAL-based protocols. Note that the version of the CAL method which only reveals responsive documents labeled by hand is not accountable. Experimental results are given for the dataset Matter 201. The left-side figure shows the test recall of two protocols implementing the CAL method that requests $N=1000$ new documents for review in each iteration as the number of iterations $T=1,2,\ldots, 30$. The right-side figure shows the non-responsive disclosure of two protocols as the number of iterations increases. We repeat our protocol ten times and plot the average recall and the non-responsive disclosure with the corresponding ranges.
  • Figure 2: E-discovery Protocol
  • Figure 3: Label-Verification Protocol
  • Figure 4: Test recall of Protocol_Label (CAL with the sampling protocol for label report), Protocol_Classifier (CAL with the sampling protocol for classifier report) and Reveal All (CAL by revealing all requested documents) on the datasets Matter 201 and 202. We plot the test recall of three protocols implementing the CAL method with the number of iterations $T=1,2,\cdots, 30$ and select $N=1000$ new documents for review in each iteration. We repeat our protocols for $10$ times and plot the average recall with the corresponding ranges.
  • Figure 5: Non-responsive disclosure of Protocol_Label (CAL with the sampling protocol for label report), Protocol_Classifier (CAL with the sampling protocol for classifier report) and Reveal All (CAL by revealing all requested documents) on the datasets Matter 201 and 202. We plot the non-responsive disclosure of three protocols with the number of iterations $T=1,2,\cdots, 30$ in the CAL method that selects $N=1000$ new documents for review in each iteration. We repeat our protocols for $10$ times and plot the average non-responsive disclosure with the corresponding ranges.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Definition 2.1
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • proof : Proof of \ref{['thm:lower-bound']}
  • Lemma A.0
  • proof : Proof of \ref{['lem:one-dim-label']}
  • Theorem A.1
  • proof : Proof of Theorem \ref{['thm:one-dim-label']}
  • ...and 12 more