Table of Contents
Fetching ...

On the Smallest Size of Internal Collage Systems

Soichiro Migita, Kyotaro Uehata, Tomohiro I

TL;DR

The paper addresses the problem of understanding the smallest size of internal collage systems $\hat{c}(T)$ and its relationship to the general collage-system size $c(T)$. It provides a constructive $O(m^2)$-time transformation that converts any collage system of size $m$ into an internal collage system of size $O(m)$, establishing $\hat{c}(T) = \Theta(c(T))$ and enabling analysis focused on internal systems with the corollary $b(T) = O(c(T))$. Additionally, it introduces a MAX-SAT formulation to compute $\hat{c}(T)$ exactly, encoding the ICS-factorization and deriving an $O(n^4)$-variable framework. Together, these results streamline the study of collage-based compression by linking internal and general measures and providing a practical method to compute them.

Abstract

A Straight-Line Program (SLP) for a string $T$ is a context-free grammar in Chomsky normal form that derives $T$ only, which can be seen as a compressed form of $T$. Kida et al.\ introduced collage systems [Theor. Comput. Sci., 2003] to generalize SLPs by adding repetition rules and truncation rules. The smallest size $c(T)$ of collage systems for $T$ has gained attention to see how these generalized rules improve the compression ability of SLPs. Navarro et al. [IEEE Trans. Inf. Theory, 2021] showed that $c(T) \in O(z(T))$ and there is a string family with $c(T) \in Ω(b(T) \log |T|)$, where $z(T)$ is the number of phrases in the Lempel-Ziv parsing of $T$ and $b(T)$ is the smallest size of bidirectional schemes for $T$. They also introduced a subclass of collage systems, called internal collage systems, and proved that its smallest size $\hat{c}(T)$ for $T$ is at least $b(T)$. While $c(T) \le \hat{c}(T)$ is obvious, it is unknown how large $\hat{c}(T)$ is compared to $c(T)$. In this paper, we prove that $\hat{c}(T) = Θ(c(T))$ by showing that any collage system of size $m$ can be transformed into an internal collage system of size $O(m)$ in $O(m^2)$ time. Thanks to this result, we can focus on internal collage systems to study the asymptotic behavior of $c(T)$, which helps to suppress excess use of truncation rules. As a direct application, we get $b(T) = O(c(T))$, which answers an open question posed in [Navarro et al., IEEE Trans. Inf. Theory, 2021]. We also give a MAX-SAT formulation to compute $\hat{c}(T)$ for a given $T$.

On the Smallest Size of Internal Collage Systems

TL;DR

The paper addresses the problem of understanding the smallest size of internal collage systems and its relationship to the general collage-system size . It provides a constructive -time transformation that converts any collage system of size into an internal collage system of size , establishing and enabling analysis focused on internal systems with the corollary . Additionally, it introduces a MAX-SAT formulation to compute exactly, encoding the ICS-factorization and deriving an -variable framework. Together, these results streamline the study of collage-based compression by linking internal and general measures and providing a practical method to compute them.

Abstract

A Straight-Line Program (SLP) for a string is a context-free grammar in Chomsky normal form that derives only, which can be seen as a compressed form of . Kida et al.\ introduced collage systems [Theor. Comput. Sci., 2003] to generalize SLPs by adding repetition rules and truncation rules. The smallest size of collage systems for has gained attention to see how these generalized rules improve the compression ability of SLPs. Navarro et al. [IEEE Trans. Inf. Theory, 2021] showed that and there is a string family with , where is the number of phrases in the Lempel-Ziv parsing of and is the smallest size of bidirectional schemes for . They also introduced a subclass of collage systems, called internal collage systems, and proved that its smallest size for is at least . While is obvious, it is unknown how large is compared to . In this paper, we prove that by showing that any collage system of size can be transformed into an internal collage system of size in time. Thanks to this result, we can focus on internal collage systems to study the asymptotic behavior of , which helps to suppress excess use of truncation rules. As a direct application, we get , which answers an open question posed in [Navarro et al., IEEE Trans. Inf. Theory, 2021]. We also give a MAX-SAT formulation to compute for a given .

Paper Structure

This paper contains 8 sections, 6 theorems, 14 equations, 6 figures.

Key Result

Theorem 1

The problem of computing $\hat{c}(T)$ for a given string $T$ is NP-hard.

Figures (6)

  • Figure 1: An illustration of the binary parse tree (right) for the internal collage system having six rules shown left. The internal nodes are depicted by circles, and the leaves by solid boxes. The characters derived from leaves are depicted with dotted boxes. This is an internal collage system because every nonterminal appears as a node label in the binary parse tree.
  • Figure 2: An illustration of the binary parse tree (right) for the non-internal collage system having six rules shown left. It is not an internal collage system because $X_4$ does not appear as a node label in the binary parse tree.
  • Figure 3: An illustration of the binary parse tree (right above) and the grammar tree (right below) for the internal collage system having 9 rules shown left. Three internal nodes with label $X_2$, $X_3$ and $X_6$, turn into leaves in the grammar tree because they are not the leftmost nodes with their own label. Observe that \ref{['prop:ics']} holds, i.e., the number of internal nodes is the size $m = 9$ of the collage system, and the number of leaves is $m - m_{\mathsf{tr}} - \sigma + 1 = 9 - 1 - 2 + 1 = 7$, where $\sigma = 2$ is the alphabet size and $m_{\mathsf{tr}} = 1$ is the number of truncation rules.
  • Figure 4: Illustration for Case 1. The upper parts show where the truncated substring $\langle Q\rangle$ exists in $\langle X\rangle$ and the lower parts show how the converted collage system represents it without using $X$. The intervals in the truncation rules are abbreviated and shown as "$[\cdot)$". The reference of a nonterminal $Q$ is changed from an unreachable nonterminal $X$ to a smaller nonterminal $Y$ or $Z$. If the new reference is unreachable, it will be processed later.
  • Figure 5: Illustration for Case 2. At most four new nonterminals including $V$ are enough to represent $\langle Q\rangle$. Note that new rules that truncate $Y$ are introduced, but $Y$ is guaranteed to be reachable via $Q$.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Proposition 2
  • Theorem 3
  • Theorem 4
  • Example 5
  • Lemma 6
  • Theorem 7