Table of Contents
Fetching ...

EntroGD: Efficient Compression and Accurate Direct Analytics on Compressed Data

Xiaobo Zhao, Daniel E. Lucani

TL;DR

EntroGD tackles the scalability bottleneck of generalized deduplication for high-dimensional data by decoupling analytics from compression: it first generates condensed samples to preserve analytics and then applies entropy-guided bit selection to maximize compression. The method reduces base-bit selection complexity from $O(nd^2)$ to $O(nd)$ while maintaining analytic fidelity, achieving competitive compression and superior analytics on compressed data across 18 datasets. Empirical results show clustering on EntroGD representations closely matches clustering on the original data (AR ≈ 1.001; AMI ≈ 0.961–0.968; Silhouette ≈ 0.393–0.394) with only a small analytics data footprint, and offers up to 53.5× faster configuration than GreedyGD. Together, EntroGD enables efficient, scalable analytics directly on compressed data for large-scale IoT and edge-cloud workloads.

Abstract

Generalized Deduplication (GD) enables lossless compression with direct analytics on compressed data by dividing data into \emph{bases} and \emph{deviations} and performing dictionary encoding on the former. However, GD algorithms face scalability challenges for high-dimensional data. For example, the GreedyGD algorithm relies on an iterative bit-selection process across $d$-dimensional data resulting in $O(nd^2)$ complexity for $n$ data rows to select bits to be used as bases and deviations. Although the $n$ data rows can be reduced during training at the expense of performance, highly dimensional data still experiences a marked loss in performance. This paper introduces EntroGD, an entropy-guided GD framework that reduces complexity of the bit-selection algorithm to $O(nd)$. EntroGD operates considers a two-step process. First, it generates condensed samples to preserve analytic fidelity. Second, it applies entropy-guided bit selection to maximize compression efficiency. Across 18 datasets of varying types and dimensionalities, EntroGD achieves compression performance comparable to GD-based and universal compressors, while reducing configuration time by up to 53.5$\times$ over GreedyGD and accelerating clustering by up to 31.6$\times$ over the original data with negligible accuracy loss by performing analytics on the condensed samples, which are much fewer than original samples. Thus, EntroGD provides an efficient and scalable solution to performing analytics directly on compressed data.

EntroGD: Efficient Compression and Accurate Direct Analytics on Compressed Data

TL;DR

EntroGD tackles the scalability bottleneck of generalized deduplication for high-dimensional data by decoupling analytics from compression: it first generates condensed samples to preserve analytics and then applies entropy-guided bit selection to maximize compression. The method reduces base-bit selection complexity from to while maintaining analytic fidelity, achieving competitive compression and superior analytics on compressed data across 18 datasets. Empirical results show clustering on EntroGD representations closely matches clustering on the original data (AR ≈ 1.001; AMI ≈ 0.961–0.968; Silhouette ≈ 0.393–0.394) with only a small analytics data footprint, and offers up to 53.5× faster configuration than GreedyGD. Together, EntroGD enables efficient, scalable analytics directly on compressed data for large-scale IoT and edge-cloud workloads.

Abstract

Generalized Deduplication (GD) enables lossless compression with direct analytics on compressed data by dividing data into \emph{bases} and \emph{deviations} and performing dictionary encoding on the former. However, GD algorithms face scalability challenges for high-dimensional data. For example, the GreedyGD algorithm relies on an iterative bit-selection process across -dimensional data resulting in complexity for data rows to select bits to be used as bases and deviations. Although the data rows can be reduced during training at the expense of performance, highly dimensional data still experiences a marked loss in performance. This paper introduces EntroGD, an entropy-guided GD framework that reduces complexity of the bit-selection algorithm to . EntroGD operates considers a two-step process. First, it generates condensed samples to preserve analytic fidelity. Second, it applies entropy-guided bit selection to maximize compression efficiency. Across 18 datasets of varying types and dimensionalities, EntroGD achieves compression performance comparable to GD-based and universal compressors, while reducing configuration time by up to 53.5 over GreedyGD and accelerating clustering by up to 31.6 over the original data with negligible accuracy loss by performing analytics on the condensed samples, which are much fewer than original samples. Thus, EntroGD provides an efficient and scalable solution to performing analytics directly on compressed data.

Paper Structure

This paper contains 19 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: An example of Generalized Deduplication.
  • Figure 2: Comparison of base bit selection in GreedyGD and EntroGD on a dataset with $n$ samples and $d = 4$ dimensions of 8-bit data. GreedyGD iteratively selects $\text{bit}^*$ that minimize the cost function as base bits, whereas EntroGD selects them in ascending order of entropy. By eliminating the iterative search, EntroGD reduces the complexity from $O(nd^2)$ to $O(nd)$.
  • Figure 3: Box plot of CR across all datasets. EntroGD achieves the second-lowest median CR after Bzip2.
  • Figure 4: Detailed performance comparison between EntroGD and GreedyGD across all datasets.
  • Figure 5: Detailed performance comparison between GreedyGD+ and GreedyGD across all datasets.
  • ...and 2 more figures