Table of Contents
Fetching ...

Parsing Gigabytes of JSON per Second

Geoff Langdale, Daniel Lemire

TL;DR

The paper tackles the bottleneck of parsing JSON at scale on commodity CPUs. It introduces simdjson, a standard-compliant JSON parser that employs a two-stage, SIMD-driven approach to rapidly identify structure and tokens, followed by tape-based construction and validation, enabling gigabytes-per-second throughput on a single core. Key contributions include a branchless, vectorized Stage 1 for structural detection and UTF-8 validation, a Stage 2 with a goto-based state machine to build a compact tape, and substantial empirical evidence showing major efficiency gains over traditional parsers like RapidJSON and sajson. The work has practical impact for high-throughput data ingestion in databases and analytics, and the authors provide open-source, reproducible software to foster adoption and further research, with clear avenues for future hardware-aware improvements (e.g., AVX-512) and cross-architecture validation.

Abstract

JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.

Parsing Gigabytes of JSON per Second

TL;DR

The paper tackles the bottleneck of parsing JSON at scale on commodity CPUs. It introduces simdjson, a standard-compliant JSON parser that employs a two-stage, SIMD-driven approach to rapidly identify structure and tokens, followed by tape-based construction and validation, enabling gigabytes-per-second throughput on a single core. Key contributions include a branchless, vectorized Stage 1 for structural detection and UTF-8 validation, a Stage 2 with a goto-based state machine to build a compact tape, and substantial empirical evidence showing major efficiency gains over traditional parsers like RapidJSON and sajson. The work has practical impact for high-throughput data ingestion in databases and analytics, and the authors provide open-source, reproducible software to foster adoption and further research, with clear avenues for future hardware-aware improvements (e.g., AVX-512) and cross-architecture validation.

Abstract

JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.

Paper Structure

This paper contains 24 sections, 9 figures, 13 tables.

Figures (9)

  • Figure 1: JSON example
  • Figure 2: JSON tape corresponding to the example in Fig. \ref{['fig:jsonexample']}
  • Figure 3: Branchless code sequence to identify escaped quote characters (with example). We use the convention of the C language: '&' denotes the bitwise AND, '|' the bitwise OR, '<<' is a left shift, '' is a bitwise negation.
  • Figure 4: Branchless code sequence to identify quoted range excluding the final quote character. CLMUL refers to the carry-less multiplication. We assume that OD is computed using the code sequence from Fig. \ref{['fig:identifyescapedquotes']}.
  • Figure 5: Branchless code sequence to identify the structural and pseudo-structural characters. The quotes Q and the quoted range R are computed using the code sequence from Fig. \ref{['fig:identifyquoteregion']}. The structural (S) and white-space (W) characters are identified vectorized classification.
  • ...and 4 more figures