Parsing Millions of URLs per Second

Yagiz Nizipli; Daniel Lemire

Parsing Millions of URLs per Second

Yagiz Nizipli, Daniel Lemire

TL;DR

The paper tackles the challenge of producing a fast, WHATWG URL-compliant parser suitable for integration into Node.js. It introduces Ada, a C++ parser that uses a single normalized buffer (url_aggregator), SIMD-accelerated fast paths, a perfect protocol-hash, and compact component tracking to minimize allocations and copies. Empirical results show Ada outperforming RFC 3986 parsers and rivaling or exceeding popular URL parsers by substantial margins, with Node.js 20 achieving four- to fivefold gains over legacy parsing and up to eightfold fewer instructions compared with curl. The work provides cross-language bindings and datasets, demonstrating practical impact on server-side web workloads and highlighting avenues for further SIMD-based optimization and broader applicability. The approach reduces URL parsing bottlenecks in modern web stacks and sets a benchmark for high-performance, standards-compliant URL parsing in production environments.

Abstract

URLs are fundamental elements of web applications. By applying vector algorithms, we built a fast standard-compliant C++ implementation. Our parser uses three times fewer instructions than competing parsers following the WHATWG standard (e.g., Servo's rust-url) and up to eight times fewer instructions than the popular curl parser. The Node.js environment adopted our C++ library. In our tests on realistic data, a recent Node.js version (20.0) with our parser is four to five times faster than the last version with the legacy URL parser.

Parsing Millions of URLs per Second

TL;DR

Abstract

Paper Structure (9 sections, 1 equation, 10 figures, 5 tables)

This paper contains 9 sections, 1 equation, 10 figures, 5 tables.

Introduction
Related Work
Parsing URL Strings
Fast Parsing
JavaScript Integration
Benchmarks
JavaScript runtime environments
JavaScript server
Conclusion

Figures (10)

Figure 1: URL Parser State Machine
Figure 2: Scan for tabulation and newline characters using SSE2 intrinsic functions. The ARM NEON version is similar.
Figure 3: Analysis of the protocol string
Figure 4: Function to detect forbidden of upper-case characters in host string
Figure 5: Path-signature function
...and 5 more figures

Parsing Millions of URLs per Second

TL;DR

Abstract

Parsing Millions of URLs per Second

Authors

TL;DR

Abstract

Table of Contents

Figures (10)