The German Tank Problem with Multiple Factories
Steven J. Miller, Kishan Sharma, Andrew K. Yang
TL;DR
This work extends the classical German Tank Problem to a setting with $l$ factories, unknown gaps $G_i$, and total production $N_{\text{tot}}=\sum_i N_i$, under sampling without replacement. It develops the GTP-UM estimator for unknown minimum, proves it is the MVUE with variance $\operatorname{Var}(\hat N_{\text{UM}})=\dfrac{2(N+1)(N-k)}{(k-1)(k+2)}$, and shows the original GTP is the MVUE in its setting. For the multi-factory problem, it analyzes the probability of missing a factory, reveals a threshold in the asymptotic regime for when samples cover all factories, and proposes a gap-informed approach that partitions samples to apply GTP to the first factory and GTP-UM to the rest; simulations demonstrate improved accuracy with sufficient samples and favorable gap structures. When restricting to equal factory sizes and known fixed gaps, the paper derives a simple unbiased estimator $\hat N=\frac{1}{l}(2M - G(l-1) - 1)$ (and a useful large-$N$ approximation), yielding substantial variance reductions even with modest sample sizes. Overall, the work provides principled MVUE results for the GTP variants and practical, robust strategies for estimating total production under complex multi-factory structures.
Abstract
During the Second World War, estimates of the number of tanks deployed by Germany were critically needed. The Allies adopted a successful statistical approach to estimate this information: assume that the tanks are sequentially numbered starting from, say, 1, and ending at an unknown positive integer $N$. If we observe the numbers of $k$ tanks, then the best linear unbiased estimator for $N$ is $M(1+1/k)-1$ where $M$ is the maximum observed serial number. While this approach was successful, there are many more adversarial situations where the approach for the original German Tank Problem falls short. Typically the number of ``factories'' is a possibly unknown $l>1$, and tanks produced by different factories may have serial numbers in disjoint ranges that are often separated by unknown amounts. Clark, Gonye and Miller (CGM) presented an unbiased estimator for $N$ when the minimum serial number is unknown. So if one can identify which samples correspond to which factory, one can then estimate each factory's range using CGM's method, and sum them for an estimate of the rival's total productivity. We present a procedure to estimate the total productivity and prove that it is effective when $\log l/\log k$ is sufficiently small. In the final section, we show that if we have a small number of samples, we can make an estimator that performs orders of magnitude better when given additional information about the size of the gaps.
