Table of Contents
Fetching ...

DeepMapping: Learned Data Mapping for Lossless Compression and Efficient Lookup

Lixi Zhou, K. Selçuk Candan, Jia Zou

TL;DR

This work argues and shows that a novel DeepMapping abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time.

Abstract

Storing tabular data to balance storage and query efficiency is a long-standing research question in the database community. In this work, we argue and show that a novel DeepMapping abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time. Such unique properties may benefit a broad class of use cases in capacity-limited devices. Our proposed DeepMapping abstraction transforms a dataset into multiple key-value mappings and constructs a multi-tasking neural network model that outputs the corresponding values for a given input key. To deal with memorization errors, DeepMapping couples the learned neural network with a lightweight auxiliary data structure capable of correcting mistakes. The auxiliary structure design further enables DeepMapping to efficiently deal with insertions, deletions, and updates even without retraining the mapping. We propose a multi-task search strategy for selecting the hybrid DeepMapping structures (including model architecture and auxiliary structure) with a desirable trade-off among memorization capacity, size, and efficiency. Extensive experiments with a real-world dataset, synthetic and benchmark datasets, including TPC-H and TPC-DS, demonstrated that the DeepMapping approach can better balance the retrieving speed and compression ratio against several cutting-edge competitors.

DeepMapping: Learned Data Mapping for Lossless Compression and Efficient Lookup

TL;DR

This work argues and shows that a novel DeepMapping abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time.

Abstract

Storing tabular data to balance storage and query efficiency is a long-standing research question in the database community. In this work, we argue and show that a novel DeepMapping abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time. Such unique properties may benefit a broad class of use cases in capacity-limited devices. Our proposed DeepMapping abstraction transforms a dataset into multiple key-value mappings and constructs a multi-tasking neural network model that outputs the corresponding values for a given input key. To deal with memorization errors, DeepMapping couples the learned neural network with a lightweight auxiliary data structure capable of correcting mistakes. The auxiliary structure design further enables DeepMapping to efficiently deal with insertions, deletions, and updates even without retraining the mapping. We propose a multi-task search strategy for selecting the hybrid DeepMapping structures (including model architecture and auxiliary structure) with a desirable trade-off among memorization capacity, size, and efficiency. Extensive experiments with a real-world dataset, synthetic and benchmark datasets, including TPC-H and TPC-DS, demonstrated that the DeepMapping approach can better balance the retrieving speed and compression ratio against several cutting-edge competitors.
Paper Structure (29 sections, 1 equation, 10 figures, 5 tables, 5 algorithms)

This paper contains 29 sections, 1 equation, 10 figures, 5 tables, 5 algorithms.

Figures (10)

  • Figure 1: DeepMapping relies on neural networks to memorize key-value mapping in tabular data.
  • Figure 2: Overview of the proposed neural network-based data compression methods.
  • Figure 3: (a) A high-level view of a candidate model and (b) a DAG (including all nodes and edges) represents the search space of one tree node in (a) -- here, the subgraph connected with the red edges illustrates a sampled network.
  • Figure 4: Trade-off between compression ratio and lookup performance in TPC-H (SF=10, B=100,000) in the small-size machine -- Annotations are explained in the footnote $^{\ref{['footnote:annotation']}}$.
  • Figure 5: Trade-off between compression ratio and lookup performance in TPC-DS (SF=10, B=100,000) in the small-size machine -- Annotations are explained in the footnote $^{\ref{['footnote:annotation']}}$.
  • ...and 5 more figures