Table of Contents
Fetching ...

The Trie Measure, Revisited

Jarno N. Alanko, Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Bojana Kodric, Nicola Prezza

TL;DR

This work addresses minimizing the trie measure for encoding sequences of integer sets by exploring structured prefix-free encodings. It develops two exact strategies for shifts: (i) an $O(u+N\log u)$-time algorithm based on a simple array representation, and (ii) an $O(N\log^2 u)$-time algorithm using a DAG-compressed segment tree, both exploiting a periodic structure in the shift; and two encodings for order-preserving (prefix-free) representations via a Knuth-like dynamic programming approach that yields an optimal ordered encoding in $O(N+u^3)$ time. It further extends the ordered encoding to shifted-ordered encodings by doubling the domain and applying the same DP framework, preserving the cubic-time bound. The paper provides practical implementations and experimental evidence on multiple real-world datasets, showing that shifted-ordered encodings can markedly outperform individual shifted or ordered encodings and that typical shifts are near-optimal on practice. These results inform space-efficient encoding choices for data-structure applications that rely on tries and set-sequences, including subset wavelet trees and offline set-intersection strategies.

Abstract

In this paper, we study the following problem: given $n$ subsets $S_1, \dots, S_n$ of an integer universe $U = \{0,\dots, u-1\}$, having total cardinality $N = \sum_{i=1}^n |S_i|$, find a prefix-free encoding $enc : U \rightarrow \{0,1\}^+$ minimizing the so-called trie measure, i.e., the total number of edges in the $n$ binary tries $\mathcal T_1, \dots, \mathcal T_n$, where $\mathcal T_i$ is the trie packing the encoded integers $\{enc(x):x\in S_i\}$. We first observe that this problem is equivalent to that of merging $u$ sets with the cheapest sequence of binary unions, a problem which in [Ghosh et al., ICDCS 2015] is shown to be NP-hard. Motivated by the hardness of the general problem, we focus on particular families of prefix-free encodings. We start by studying the fixed-length shifted encoding of [Gupta et al., Theoretical Computer Science 2007]. Given a parameter $0\le a < u$, this encoding sends each $x \in U$ to $(x + a) \mod u$, interpreted as a bit-string of $\log u$ bits. We develop the first efficient algorithms that find the value of $a$ minimizing the trie measure when this encoding is used. Our two algorithms run in $O(u + N\log u)$ and $O(N\log^2 u)$ time, respectively. We proceed by studying ordered encodings (a.k.a. monotone or alphabetic), and describe an algorithm finding the optimal such encoding in $O(N+u^3)$ time. Within the same running time, we show how to compute the best shifted ordered encoding, provably no worse than both the optimal shifted and optimal ordered encodings. We provide implementations of our algorithms and discuss how these encodings perform in practice.

The Trie Measure, Revisited

TL;DR

This work addresses minimizing the trie measure for encoding sequences of integer sets by exploring structured prefix-free encodings. It develops two exact strategies for shifts: (i) an -time algorithm based on a simple array representation, and (ii) an -time algorithm using a DAG-compressed segment tree, both exploiting a periodic structure in the shift; and two encodings for order-preserving (prefix-free) representations via a Knuth-like dynamic programming approach that yields an optimal ordered encoding in time. It further extends the ordered encoding to shifted-ordered encodings by doubling the domain and applying the same DP framework, preserving the cubic-time bound. The paper provides practical implementations and experimental evidence on multiple real-world datasets, showing that shifted-ordered encodings can markedly outperform individual shifted or ordered encodings and that typical shifts are near-optimal on practice. These results inform space-efficient encoding choices for data-structure applications that rely on tries and set-sequences, including subset wavelet trees and offline set-intersection strategies.

Abstract

In this paper, we study the following problem: given subsets of an integer universe , having total cardinality , find a prefix-free encoding minimizing the so-called trie measure, i.e., the total number of edges in the binary tries , where is the trie packing the encoded integers . We first observe that this problem is equivalent to that of merging sets with the cheapest sequence of binary unions, a problem which in [Ghosh et al., ICDCS 2015] is shown to be NP-hard. Motivated by the hardness of the general problem, we focus on particular families of prefix-free encodings. We start by studying the fixed-length shifted encoding of [Gupta et al., Theoretical Computer Science 2007]. Given a parameter , this encoding sends each to , interpreted as a bit-string of bits. We develop the first efficient algorithms that find the value of minimizing the trie measure when this encoding is used. Our two algorithms run in and time, respectively. We proceed by studying ordered encodings (a.k.a. monotone or alphabetic), and describe an algorithm finding the optimal such encoding in time. Within the same running time, we show how to compute the best shifted ordered encoding, provably no worse than both the optimal shifted and optimal ordered encodings. We provide implementations of our algorithms and discuss how these encodings perform in practice.

Paper Structure

This paper contains 16 sections, 7 theorems, 18 equations, 7 figures, 2 tables, 5 algorithms.

Key Result

Lemma 3

Let $S=\{x_1,\cdots,x_m\}$ be a set of $m$ integers with $0\le x_1 <\cdots<x_m<u$. Let us define $x_{m+1}:=x_1+u$. For every $a\in U$, it holds that

Figures (7)

  • Figure 1: Example of trie encoding the set of integers $\{3,4,6\} \subseteq \{0,1,\dots, 7\}$ over universe of size $u=8$. Black edges belong to the trie. Gray edges do not belong to the trie and are shown only for completeness. Each integer is encoded using $\log 8 = 3$ bits (logarithms are in base 2). The trie has 8 edges, so the trie measure for this set using the standard integer encoding is 8.
  • Figure 2: A trie that stores the same set of integers $\{3,4,6\} \subseteq \{0,1,\dots, 7\}$ of Figure \ref{['fig:trie-example1']}, but with the shifted integer encoding mapping each $x\in U$ to (the binary string of $\log u$ bits) $(x+1)\mathop{\mathrm{mod}}\nolimits 8$. The trie has 6 edges, so the shifted trie measure with shift $a=1$ is 6.
  • Figure 3: Left. An ordered prefix-free encoding $\mathop{\mathrm{enc}}\nolimits$ of the universe $U = \{0,1,2,3\}$, represented as a binary trie $T^{\mathop{\mathrm{enc}}\nolimits}$. This encoding is used (right part) to encode sets $S_1 = \{1,2\}, S_2 = \{0,1\}, S_3 = \{1,2,3\}$. The corresponding sets $A_x$ are: $A_0 = \{2\}, A_1 = \{1,2,3\}, A_2 = \{1,3\}, A_3 = \{3\}$. Highlighted in red, an edge leading to a node $v$ with $\cup_{x\in T^{\mathop{\mathrm{enc}}\nolimits}_v} A_x = A_2 \cup A_3 = \{1,3\}$, meaning that the tries for (the encoded) $S_1$ and $S_3$ will contain a copy of the same edge. Right. Using the prefix-free encoding $\mathop{\mathrm{enc}}\nolimits$ to encode $S_1,S_2,S_3$ by packing their codes into three tries (gray edges do not belong to the tries and are shown only for completeness). The tries for $S_1,S_2,S_3$ contain in total 12 edges, so $\mathop{\mathrm{trie}}\nolimits(\mathop{\mathrm{enc}}\nolimits(\langle S_1, S_2, S_3\rangle)) = 12$. In red: the two copies of the red edge on the left part of the figure, highlighting the equivalence of the two formulations of our trie-encoding problem. As a matter of fact, this is an optimal ordered code.
  • Figure 4: The nodes in the trie of $S=\{x_1, x_2, x_3, x_4\} = \{2,4,10,13\}$ in universe $u = 16$. The $i$-th box on each row spans range $[x_{i}, x_{i+1})$, with $x_5 := x_1 + u$. The number of edges in the trie of the set is equal to the number of shaded boxes that contain at least one 1-bit. The shaded boxes correspond to the edges with the same color.
  • Figure 5: Representation of $\mathop{\mathrm{trie}}\nolimits(S+a)$ based on $c^{(k)}_j$ for $S=\{2,4,10,13\}$ at level $k=3$ and $u=16$. Each row represents $c^{(3)}_{j+a}$, for $a\in U$. Boxes indicate ranges $[x_i,x_{i+1})$ and red boxes contain at least one 1. As an example, consider the pair $x_1, x_2$ (leftmost boxes). Among the shifts $0,\ldots,4$, the only shifts for which we do not pay the cost of $k=3$ for this pair are $a=2$ and $a=3$; i.e., $x_1+a$ and $x_2+a$ share an edge at level $3$ for $a=2,3$ in $\mathop{\mathrm{trie}}\nolimits(S + a)$. Hence, in the figure we have two non-red boxes in the rows corresponding to $a=2,3$.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Definition 3
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Theorem 7
  • Theorem 8
  • Lemma 8