Parsing Gigabytes of JSON per Second
Geoff Langdale, Daniel Lemire
TL;DR
The paper tackles the bottleneck of parsing JSON at scale on commodity CPUs. It introduces simdjson, a standard-compliant JSON parser that employs a two-stage, SIMD-driven approach to rapidly identify structure and tokens, followed by tape-based construction and validation, enabling gigabytes-per-second throughput on a single core. Key contributions include a branchless, vectorized Stage 1 for structural detection and UTF-8 validation, a Stage 2 with a goto-based state machine to build a compact tape, and substantial empirical evidence showing major efficiency gains over traditional parsers like RapidJSON and sajson. The work has practical impact for high-throughput data ingestion in databases and analytics, and the authors provide open-source, reproducible software to foster adoption and further research, with clear avenues for future hardware-aware improvements (e.g., AVX-512) and cross-architecture validation.
Abstract
JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.
