Table of Contents
Fetching ...

Two Point Correlation Function Estimation with Contaminated Data

Arya Farahi

Abstract

The two-point correlation function (2PCF) is a cornerstone of precision cosmology, yet its estimation from imaging surveys is vulnerable to contamination and incompleteness arising from imperfect target selection and pipeline-level inclusion decisions. In practice, the scientific target is a physically defined population, while the working catalog is constructed from noisy measurements and selection cuts, leading to mismatches between true and observed inclusion. These errors are often spatially structured, correlating with survey depth, observing conditions, and foregrounds, and can imprint spurious large-scale power or suppress the true clustering signal. High-resolution spectroscopic samples provide gold-standard inclusion in the target population but are typically available for only a small subset of objects. We introduce a prediction-powered Landy--Szalay (PP--LS) estimator that combines noisy inclusion labels across the full catalog with exact labels on a small spectroscopic subset while preserving the standard random-catalog normalization for survey geometry and selection. PP--LS debiases pair counts using residual-based, design-weighted corrections computed only on the labeled subset, requiring no probability calibration, known misclassification rates, or explicit modeling of contamination. Under simple random sampling of the labeled subset, we establish recovery of the oracle (true-label) Landy--Szalay pair counts and thus consistency for the target 2PCF. In simulations with clustered and spatially structured contaminants, PP--LS removes the bias of naive catalog-level estimators while achieving substantially lower variance than spectroscopic-only clustering. The resulting estimator is statistically principled, computationally lightweight, and integrates directly with standard pair-counting pipelines, enabling robust clustering inference in next-generation surveys.

Two Point Correlation Function Estimation with Contaminated Data

Abstract

The two-point correlation function (2PCF) is a cornerstone of precision cosmology, yet its estimation from imaging surveys is vulnerable to contamination and incompleteness arising from imperfect target selection and pipeline-level inclusion decisions. In practice, the scientific target is a physically defined population, while the working catalog is constructed from noisy measurements and selection cuts, leading to mismatches between true and observed inclusion. These errors are often spatially structured, correlating with survey depth, observing conditions, and foregrounds, and can imprint spurious large-scale power or suppress the true clustering signal. High-resolution spectroscopic samples provide gold-standard inclusion in the target population but are typically available for only a small subset of objects. We introduce a prediction-powered Landy--Szalay (PP--LS) estimator that combines noisy inclusion labels across the full catalog with exact labels on a small spectroscopic subset while preserving the standard random-catalog normalization for survey geometry and selection. PP--LS debiases pair counts using residual-based, design-weighted corrections computed only on the labeled subset, requiring no probability calibration, known misclassification rates, or explicit modeling of contamination. Under simple random sampling of the labeled subset, we establish recovery of the oracle (true-label) Landy--Szalay pair counts and thus consistency for the target 2PCF. In simulations with clustered and spatially structured contaminants, PP--LS removes the bias of naive catalog-level estimators while achieving substantially lower variance than spectroscopic-only clustering. The resulting estimator is statistically principled, computationally lightweight, and integrates directly with standard pair-counting pipelines, enabling robust clustering inference in next-generation surveys.
Paper Structure (25 sections, 1 theorem, 85 equations, 5 figures, 1 table)

This paper contains 25 sections, 1 theorem, 85 equations, 5 figures, 1 table.

Key Result

Lemma 3.1

If $L$ is a simple random subset of size $m$, then

Figures (5)

  • Figure 1: Left. Simulated true target sources drawn from a clustered Thomas process in a unit square. Middle. Objects with $\widetilde{Y}=1$, mimicking noisy source catalog with 30% contamination (false positive). Spatially varying contamination produces a hotspot and gradient that biases pair counts if uncorrected. Right. Estimated 2PCF $\xi(r)$. Oracle (LS on $Y$), Noisy (LS on $\widetilde{Y}$), and PP--LS method. Noisy LS is biased; PP--LS aligns with the oracle by using residual corrections from a small spectroscopic subset.
  • Figure 2: Bias (left) and variance (right) of 2PCF estimators as a function of separation $r$, averaged over $N_R=2000$ simulated realizations. Bias is computed relative to the oracle estimator on sample $G$, which has access to the true labels. The Naïve LS estimator exhibits strong small-scale bias but low variance, while the cross-correlation and PP-LS estimators remain approximately unbiased with increased variance. The LS estimator on the spectroscopic sample $L$ shows the largest variance due to reduced sample size ($m/n \ll 1$). The oracle estimator provides the minimum-variance reference for unbiased estimators.
  • Figure 3: Variance ratio of the PP-LS and LS estimators on $S$ and $L$, relatively, relative to the oracle estimator on $G$ as a function of the labeled sample size, shown for a fixed separation bin $r \in [0.01, 0.02]$. Each point is estimated from 800 independent realizations. The spectroscopic LS estimator exhibits extremely large variance at small labeled fractions, while the PP-LS estimator achieves substantially lower variance by leveraging the unlabeled sample. The dashed horizontal line indicates the oracle baseline variance (normalized to 1).
  • Figure 4: Variance ratio of the PP-LS estimator relative to the oracle estimator as a function of the classification error rate in the labeled (spectroscopic) sample, shown for a fixed separation bin $r \in [0.01, 0.02]$. The labeled fraction is fixed at $2\%$. Target sources are drawn from a Thomas process with a clustered structure and embedded in an inhomogeneous contaminant field. Each point is estimated from $800$ independent realizations. The lower dashed horizontal line indicates the oracle variance, while the upper dashed line shows the variance of the LS estimator on a purely spectroscopic sample $L$ with the $2\%$ labeled fraction.
  • Figure 5: Spatial structure of contamination and classification errors used in the simulations. Left: contaminant intensity field defining the inhomogeneous Poisson process from which contaminant sources are drawn, including a smooth gradient and a localized overdensity. Middle: spatial density of false positives (contaminants misclassified as signal). Right: spatial density of false negatives (signal sources misclassified as contaminants). The spatial correlation between these fields introduces realistic, position-dependent classification errors.

Theorems & Definitions (2)

  • Lemma 3.1: Unbiasedness
  • proof