Table of Contents
Fetching ...

Corra: Correlation-Aware Column Compression

Hanwen Liu, Mihail Stoian, Alexander van Renen, Andreas Kipf

TL;DR

Single-column encodings have plateaued in compression; Corra addresses this by introducing horizontal, correlation-aware encoding schemes that diff-encode a column with respect to reference columns. It presents non-hierarchical and hierarchical schemes, plus support for multiple reference columns and outlier handling, with a cost-based method to choose references. Empirical results on TPC-H lineitem, LDBC Message, DMV, and Taxi show substantial compression gains over single-column baselines and competitive performance with C3, alongside expected query-latency trade-offs due to cross-column access. The work promises significant storage and memory savings for data lakes and cloud databases, with future directions toward broader correlation types and automated correlation detection.

Abstract

Column encoding schemes have witnessed a spark of interest with the rise of open storage formats (like Parquet) in data lakes in modern cloud deployments. This is not surprising -- as data volume increases, it becomes more and more important to reduce storage cost on block storage (such as S3) as well as reduce memory pressure in multi-tenant in-memory buffers of cloud databases. However, single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is due to the neglect of cross-column correlations. For instance, consider the column pair ($\texttt{city}$, $\texttt{zip_code}$). Typically, cities have only a few dozen unique zip codes. If this information is properly exploited, it can significantly reduce the space consumption of the latter column. In this work, we depart from the established path of compressing data using only single-column encoding schemes and introduce several what we call $\textit{horizontal}$, correlation-aware encoding schemes. We demonstrate their advantages over single-column encoding schemes on the well-known TPC-H's $\texttt{lineitem}$, LDBC's $\texttt{message}$, DMV, and Taxi datasets. Our correlation-aware encoding schemes save up to 58.3% of the compressed size over single-column schemes for $\texttt{lineitem}$'s $\texttt{receiptdate}$, 53.7% for DMV's $\texttt{zip_code}$, and 85.16% for Taxi's $\texttt{total_amount}$.

Corra: Correlation-Aware Column Compression

TL;DR

Single-column encodings have plateaued in compression; Corra addresses this by introducing horizontal, correlation-aware encoding schemes that diff-encode a column with respect to reference columns. It presents non-hierarchical and hierarchical schemes, plus support for multiple reference columns and outlier handling, with a cost-based method to choose references. Empirical results on TPC-H lineitem, LDBC Message, DMV, and Taxi show substantial compression gains over single-column baselines and competitive performance with C3, alongside expected query-latency trade-offs due to cross-column access. The work promises significant storage and memory savings for data lakes and cloud databases, with future directions toward broader correlation types and automated correlation detection.

Abstract

Column encoding schemes have witnessed a spark of interest with the rise of open storage formats (like Parquet) in data lakes in modern cloud deployments. This is not surprising -- as data volume increases, it becomes more and more important to reduce storage cost on block storage (such as S3) as well as reduce memory pressure in multi-tenant in-memory buffers of cloud databases. However, single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is due to the neglect of cross-column correlations. For instance, consider the column pair (, ). Typically, cities have only a few dozen unique zip codes. If this information is properly exploited, it can significantly reduce the space consumption of the latter column. In this work, we depart from the established path of compressing data using only single-column encoding schemes and introduce several what we call , correlation-aware encoding schemes. We demonstrate their advantages over single-column encoding schemes on the well-known TPC-H's , LDBC's , DMV, and Taxi datasets. Our correlation-aware encoding schemes save up to 58.3% of the compressed size over single-column schemes for 's , 53.7% for DMV's , and 85.16% for Taxi's .
Paper Structure (7 sections, 8 figures, 3 tables, 1 algorithm)

This paper contains 7 sections, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Vertical (prior work) vs. horizontal encodings (ours): Exploiting the correlation between date columns in TPC-H's lineitem table for the column pair $(\texttt{shipdate}, \texttt{commitdate})$. Instead of encoding commitdate w.r.t. to its own values (vertically), it is better to encode it w.r.t. shipdate (horizontally). The dashed arrows show the corresponding dependency.
  • Figure 2: Detecting the optimal diff-encoding configuration in TPC-H (SF 10) for its three date-valued columns. The weight of an $a \rightarrow b$ edge is the size of column $a$ when diff-encoded w.r.t. reference column $b$.
  • Figure 3: Hierarchical encoding: Exploiting the correlation of the column-pair $(\texttt{city}, \texttt{zip-code})$ in the DMV dataset. The metadata contains an array of zip-codes along with an array of offsets for each individual city starting from. The city dictionary is used to reconstruct the city column.
  • Figure 4: Non-hierarchical compression with multiple reference columns: Encoding the original target column with outliers (in this case, $\{O_1, O_2\}$). The regular values, $\{V_1, V_2, V_3\}$, are encoded as described in Tab. \ref{['tab:logical-inferencing']}.
  • Figure 5: Query latency for selectivities in $\{0.001, 0.002, \ldots, 0.9, 1.0\}$ with materialization of the query output. We run non-hierarchical encoding (§\ref{['subsec:non_hierarchical']}) on TPC-H's lineitem (SF 10) for l_shipdate (reference) and l_receiptdate (diff-encoded), and hierarchical encoding (§\ref{['subsec:hierarchical']}) on LDBC's message (SF 30) for countryid (reference) and ip (diff-encoded).
  • ...and 3 more figures