Table of Contents
Fetching ...

Binary Jumbled Indexing: Suffix tree histogram

Luís Cunha, Mário Medina

TL;DR

This work studies Binary Jumbled Indexing (BJI) for binary strings, where a Parikh vector $(a,b)$ indicates a substring with $a$ zeros and $b$ ones; the problem admits $O(1)$ queries using a compact index built on interval properties and prefix normal forms, with prior work by Cunha et al. achieving $O(n+ρ^2)$ time. The authors introduce SfTree, a suffix-tree-based indexing approach that eliminates duplicate substrings and leverages vectorization to reduce memory latency, achieving practical speedups while retaining similar asymptotic bounds. They prove the average number of runs in a binary string is $ρ = n/4$, yielding an average-case complexity of $Θ(n^2)$ for run-based indexing, and establish a lower bound $Ω(n+ρ^2)$ for such approaches. Empirical results on random and structured inputs show SfTree substantially outperforms the baseline, with potential extensions to other suffix-tree applications.

Abstract

Given a binary string $ω$ over the alphabet $\{0, 1\}$, a vector $(a, b)$ is a Parikh vector if and only if a factor of $ω$ contains exactly $a$ occurrences of $0$ and $b$ occurrences of $1$. Answering whether a vector is a Parikh vector of $ω$ is known as the Binary Jumbled Indexing Problem (BJPMP) or the Histogram Indexing Problem. Most solutions to this problem rely on an $O(n)$ word-space index to answer queries in constant time, encoding the Parikh set of $ω$, i.e., all its Parikh vectors. Cunha et al. (Combinatorial Pattern Matching, 2017) introduced an algorithm (JBM2017), which computes the index table in $O(n+ρ^2)$ time, where $ρ$ is the number of runs of identical digits in $ω$, leading to $O(n^2)$ in the worst case. We prove that the average number of runs $ρ$ is $n/4$, confirming the quadratic behavior also in the average-case. We propose a new algorithm, SFTree, which uses a suffix tree to remove duplicate substrings. Although SFTree also has an average-case complexity of $Θ(n^2)$ due to the fundamental reliance on run boundaries, it achieves practical improvements by minimizing memory access overhead through vectorization. The suffix tree further allows distinct substrings to be processed efficiently, reducing the effective cost of memory access. As a result, while both algorithms exhibit similar theoretical growth, SFTree significantly outperforms others in practice. Our analysis highlights both the theoretical and practical benefits of the SFTree approach, with potential extensions to other applications of suffix trees.

Binary Jumbled Indexing: Suffix tree histogram

TL;DR

This work studies Binary Jumbled Indexing (BJI) for binary strings, where a Parikh vector indicates a substring with zeros and ones; the problem admits queries using a compact index built on interval properties and prefix normal forms, with prior work by Cunha et al. achieving time. The authors introduce SfTree, a suffix-tree-based indexing approach that eliminates duplicate substrings and leverages vectorization to reduce memory latency, achieving practical speedups while retaining similar asymptotic bounds. They prove the average number of runs in a binary string is , yielding an average-case complexity of for run-based indexing, and establish a lower bound for such approaches. Empirical results on random and structured inputs show SfTree substantially outperforms the baseline, with potential extensions to other suffix-tree applications.

Abstract

Given a binary string over the alphabet , a vector is a Parikh vector if and only if a factor of contains exactly occurrences of and occurrences of . Answering whether a vector is a Parikh vector of is known as the Binary Jumbled Indexing Problem (BJPMP) or the Histogram Indexing Problem. Most solutions to this problem rely on an word-space index to answer queries in constant time, encoding the Parikh set of , i.e., all its Parikh vectors. Cunha et al. (Combinatorial Pattern Matching, 2017) introduced an algorithm (JBM2017), which computes the index table in time, where is the number of runs of identical digits in , leading to in the worst case. We prove that the average number of runs is , confirming the quadratic behavior also in the average-case. We propose a new algorithm, SFTree, which uses a suffix tree to remove duplicate substrings. Although SFTree also has an average-case complexity of due to the fundamental reliance on run boundaries, it achieves practical improvements by minimizing memory access overhead through vectorization. The suffix tree further allows distinct substrings to be processed efficiently, reducing the effective cost of memory access. As a result, while both algorithms exhibit similar theoretical growth, SFTree significantly outperforms others in practice. Our analysis highlights both the theoretical and practical benefits of the SFTree approach, with potential extensions to other applications of suffix trees.
Paper Structure (23 sections, 3 theorems, 3 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 3 theorems, 3 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

theorem 1

The average number of runs $\rho$ in a string with size $n$ is $\frac{n}{4}$.

Figures (3)

  • Figure 1: Build in: https://brenden.github.io/ukkonen-animation/. The node labels correspond to the steps of Ukkonen's algorithm.
  • Figure 2: Execution time comparison between JBM2017 and SFTree algorithms. The y-axis represents execution time in seconds, while the x-axis represents the input size in thousands of digits. The graph illustrates the simulated asymptotic growth curves for both algorithms.
  • Figure 3: Asymptotic comparison between JBM2017 and SFTree in range(50, 25000, 50)

Theorems & Definitions (6)

  • theorem 1
  • proof
  • lemma thmcounterlemma
  • proof
  • theorem 2
  • proof