Table of Contents
Fetching ...

Parsing Millions of URLs per Second

Yagiz Nizipli, Daniel Lemire

TL;DR

The paper tackles the challenge of producing a fast, WHATWG URL-compliant parser suitable for integration into Node.js. It introduces Ada, a C++ parser that uses a single normalized buffer (url_aggregator), SIMD-accelerated fast paths, a perfect protocol-hash, and compact component tracking to minimize allocations and copies. Empirical results show Ada outperforming RFC 3986 parsers and rivaling or exceeding popular URL parsers by substantial margins, with Node.js 20 achieving four- to fivefold gains over legacy parsing and up to eightfold fewer instructions compared with curl. The work provides cross-language bindings and datasets, demonstrating practical impact on server-side web workloads and highlighting avenues for further SIMD-based optimization and broader applicability. The approach reduces URL parsing bottlenecks in modern web stacks and sets a benchmark for high-performance, standards-compliant URL parsing in production environments.

Abstract

URLs are fundamental elements of web applications. By applying vector algorithms, we built a fast standard-compliant C++ implementation. Our parser uses three times fewer instructions than competing parsers following the WHATWG standard (e.g., Servo's rust-url) and up to eight times fewer instructions than the popular curl parser. The Node.js environment adopted our C++ library. In our tests on realistic data, a recent Node.js version (20.0) with our parser is four to five times faster than the last version with the legacy URL parser.

Parsing Millions of URLs per Second

TL;DR

The paper tackles the challenge of producing a fast, WHATWG URL-compliant parser suitable for integration into Node.js. It introduces Ada, a C++ parser that uses a single normalized buffer (url_aggregator), SIMD-accelerated fast paths, a perfect protocol-hash, and compact component tracking to minimize allocations and copies. Empirical results show Ada outperforming RFC 3986 parsers and rivaling or exceeding popular URL parsers by substantial margins, with Node.js 20 achieving four- to fivefold gains over legacy parsing and up to eightfold fewer instructions compared with curl. The work provides cross-language bindings and datasets, demonstrating practical impact on server-side web workloads and highlighting avenues for further SIMD-based optimization and broader applicability. The approach reduces URL parsing bottlenecks in modern web stacks and sets a benchmark for high-performance, standards-compliant URL parsing in production environments.

Abstract

URLs are fundamental elements of web applications. By applying vector algorithms, we built a fast standard-compliant C++ implementation. Our parser uses three times fewer instructions than competing parsers following the WHATWG standard (e.g., Servo's rust-url) and up to eight times fewer instructions than the popular curl parser. The Node.js environment adopted our C++ library. In our tests on realistic data, a recent Node.js version (20.0) with our parser is four to five times faster than the last version with the legacy URL parser.
Paper Structure (9 sections, 1 equation, 10 figures, 5 tables)

This paper contains 9 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: URL Parser State Machine
  • Figure 2: Scan for tabulation and newline characters using SSE2 intrinsic functions. The ARM NEON version is similar.
  • Figure 3: Analysis of the protocol string
  • Figure 4: Function to detect forbidden of upper-case characters in host string
  • Figure 5: Path-signature function
  • ...and 5 more figures