Deterministic Retrieval at Scale: Optimal-Space LCP Indexing and 308x Energy Reduction on Modern GPUs
Stanislav Byriukov
TL;DR
This work tackles deterministic top-$k$ retrieval under Longest Common Prefix (LCP) similarity for datasets of $N$ sequences length $L$, proving a space lower bound and delivering a trie-based index that uses $O(NL)$ space with $O(L+k)$ query time. It introduces Thermal-Aware Logic (TAL) to transform prefix structures into energy-efficient, range-bounded scans, achieving $308\times$ energy reduction and $329\times$ latency improvements on a 20M-item benchmark on NVIDIA GPUs while maintaining near-peak utilization. The authors also establish a determinism guarantee for all executions and provide extensive hardware validation, showing that the LCP-Index both preserves exact results and operates efficiently enough for safety-critical, edge, and distributed systems. The combination of provable optimality, deterministic retrieval, and substantial energy savings has direct implications for certification-compliant memory retrieval in aerospace, automotive, and multi-agent robotics contexts. These results advance a practical deterministic primitive for scalable similarity retrieval where approximate methods are unacceptable. All math is presented with $...$ delimiters where used, and the work emphasizes a path toward reproducible, certifiable high-performance retrieval primitives.
Abstract
We study deterministic top-k retrieval under Longest Common Prefix (LCP) similarity for N sequences of length L. We prove a tight Omega(N) space lower bound (cell-probe model) and present a trie-based index using O(N*L) space with O(L+k) query time. We contrast this with pairwise materialization (Theta(N^2)), which hits a practical OOM wall at scale, while our indexed approach remains O(N) in memory. We then introduce Thermal-Aware Logic (TAL), which turns prefix structure into range-bounded scans. In hardware measurements, TAL reduces energy per query by 308x (0.0145 J vs 4.46 J) and cuts p95 latency by 329x (0.114 ms vs 37.5 ms) on a 20M-item range-scan benchmark, while sustaining near-peak utilization (~99%) under long runs. The result is a deterministic retrieval primitive with receipts in regimes where approximate methods are unacceptable.
