SIEVE: Towards Verifiable Certification for Code-datasets
Fatou Ndiaye Mbodji, El-hacen Diallo, Jordan Samhi, Kui Liu, Jacques Klein, Tegawendé F. Bissyande
TL;DR
SIEVE tackles the lack of verifiable quality guarantees for code-related datasets by introducing a per-property certification framework that yields machine-readable certificates with anytime-valid statistical bounds. It defines a governance model with sponsors, validators, and arbiters, anchored by a smart contract and off-chain evidence to enable reproducible, auditable audits without rescanning entire datasets. Central to the approach are Confidence Cards that use anytime-valid confidence sequences to bound the true violation rate $p$ for each property, represented by live intervals $[L_t,U_t]$ under guarantees like $ ext{Pr}(orall t: p \in [L_t,U_t]) \ge 1-\delta$. The paper outlines a practical workflow for issuing, updating, and challenging attestations, and it sketches future work to integrate SIEVE with IDEs, CI pipelines, and data catalogs to reduce duplicated cleaning and increase trust in code datasets.
Abstract
Code agents and empirical software engineering rely on public code datasets, yet these datasets lack verifiable quality guarantees. Static 'dataset cards' inform, but they are neither auditable nor do they offer statistical guarantees, making it difficult to attest to dataset quality. Teams build isolated, ad-hoc cleaning pipelines. This fragments effort and raises cost. We present SIEVE, a community-driven framework. It turns per-property checks into Confidence Cards-machine-readable, verifiable certificates with anytime-valid statistical bounds. We outline a research plan to bring SIEVE to maturity, replacing narrative cards with anytime-verifiable certification. This shift is expected to lower quality-assurance costs and increase trust in code-datasets.
