Seamless, Correct, and Generic Programming over Serialised Data
Guillaume Allais
TL;DR
The paper tackles the problem of safely and efficiently manipulating data stored in serialized buffers without fully deserialising it, by building a universe of serialised datatype descriptions and leveraging Quantitative Type Theory in Idris 2. It develops a semantic foundation where descriptions are endofunctors and datatypes are their initial algebras, and introduces a serialisation format with offsets that enables random access to subterms. It then defines typed pointers, views, and a trusted core of IO primitives to operate on buffers, together with generic folds and serialisation combinators that are correct-by-construction. Benchmarks show substantial speedups for operations that avoid deserialisation, and the approach sets the stage for robust, generic, and correct-by-construction processing of serialised data with potential broad impact in systems where in-buffer data processing is advantageous.
Abstract
In typed functional languages, one can typically only manipulate data in a type-safe manner if it first has been deserialised into an in-memory tree represented as a graph of nodes-as-structs and subterms-as-pointers. We demonstrate how we can use QTT as implemented in \idris{} to define a small universe of serialised datatypes, and provide generic programs allowing users to process values stored contiguously in buffers. Our approach allows implementors to prove the full functional correctness by construction of the IO functions processing the data stored in the buffer.
