Table of Contents
Fetching ...

OpenZL: A Graph-Based Model for Compression

Yann Collet, Nick Terrell, W. Felix Handte, Danielle Rozenblit, Victor Zhang, Kevin Zhang, Yaelle Goldschlag, Jennifer Lee, Elliot Gorokhovsky, Yonatan Komornik, Daniel Riegel, Stan Angelov, Nadav Rotem

TL;DR

OpenZL introduces a graph-based model of compression that represents compression as a directed acyclic graph of modular codecs, enabling a universal decoder and a self-describing wire format. This framework allows domain experts to build specialized compressors with minimal code while maintaining broad deployability, security, and maintainability, addressing common drawbacks of traditional app-specific solutions. Empirical results across diverse datasets show trained OpenZL compressors achieving superior compression ratios and competitive speeds compared with state-of-the-art general-purpose compressors, with Meta reporting meaningful production gains and faster development cycles. The work also provides tooling for parsing, training, and deployment (SDDL, ACE, training orchestration) and outlines a pathway toward ML-guided compressor generation and expanded typed-data LZ approaches.

Abstract

Research techniques in the last decade have improved lossless compression ratios by significantly increasing processing time. These techniques have remained obscure because production systems require high throughput and low resource utilization. In practice, application-specific compression algorithms that leverage knowledge of the data structure and semantics are more popular. Application-specific compressor systems outperform even the best generic compressors, but these techniques have some drawbacks. Application-specific compressors are inherently limited in applicability, have high development costs, and are difficult to maintain and deploy. In this work, we show that these challenges can be overcome with a new compression strategy. We propose the "graph model" of compression, a new theoretical framework for representing compression as a directed acyclic graph of modular codecs. OpenZL compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder. OpenZL's design enables rapid development of tailored compressors with minimal code; its universal decoder eliminates deployment lag; and its investment in a well-vetted standard component library minimizes security risks. Experimental results demonstrate that OpenZL achieves superior compression ratios and speeds compared to state-of-the-art general-purpose compressors on a variety of real-world datasets. Internal deployments at Meta have also shown consistent improvements in size and/or speed, with development timelines reduced from months to days. OpenZL thus represents a significant advance in practical, scalable, and maintainable data compression for modern data-intensive applications.

OpenZL: A Graph-Based Model for Compression

TL;DR

OpenZL introduces a graph-based model of compression that represents compression as a directed acyclic graph of modular codecs, enabling a universal decoder and a self-describing wire format. This framework allows domain experts to build specialized compressors with minimal code while maintaining broad deployability, security, and maintainability, addressing common drawbacks of traditional app-specific solutions. Empirical results across diverse datasets show trained OpenZL compressors achieving superior compression ratios and competitive speeds compared with state-of-the-art general-purpose compressors, with Meta reporting meaningful production gains and faster development cycles. The work also provides tooling for parsing, training, and deployment (SDDL, ACE, training orchestration) and outlines a pathway toward ML-guided compressor generation and expanded typed-data LZ approaches.

Abstract

Research techniques in the last decade have improved lossless compression ratios by significantly increasing processing time. These techniques have remained obscure because production systems require high throughput and low resource utilization. In practice, application-specific compression algorithms that leverage knowledge of the data structure and semantics are more popular. Application-specific compressor systems outperform even the best generic compressors, but these techniques have some drawbacks. Application-specific compressors are inherently limited in applicability, have high development costs, and are difficult to maintain and deploy. In this work, we show that these challenges can be overcome with a new compression strategy. We propose the "graph model" of compression, a new theoretical framework for representing compression as a directed acyclic graph of modular codecs. OpenZL compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder. OpenZL's design enables rapid development of tailored compressors with minimal code; its universal decoder eliminates deployment lag; and its investment in a well-vetted standard component library minimizes security risks. Experimental results demonstrate that OpenZL achieves superior compression ratios and speeds compared to state-of-the-art general-purpose compressors on a variety of real-world datasets. Internal deployments at Meta have also shown consistent improvements in size and/or speed, with development timelines reduced from months to days. OpenZL thus represents a significant advance in practical, scalable, and maintainable data compression for modern data-intensive applications.

Paper Structure

This paper contains 58 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: An example invocation of the tokenize codec.
  • Figure 2: An example compressor that uses tokenize, Huffman, and LZ77.
  • Figure 3: An example of function graph expansion. Function graphs are shaded and their expansions marked in dotted lines.
  • Figure 4: Common abstract compressor structure.
  • Figure 5: Abstract compressor training workflow.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Definition 3.1: Message Sets
  • Definition 3.2: Codec
  • Remark
  • Definition 3.3: Compression Graph
  • Definition 3.4: Function graph
  • Definition 3.5: Resolved Graph