Binary Jumbled Indexing: Suffix tree histogram

Luís Cunha; Mário Medina

Binary Jumbled Indexing: Suffix tree histogram

Luís Cunha, Mário Medina

TL;DR

This work studies Binary Jumbled Indexing (BJI) for binary strings, where a Parikh vector $(a,b)$ indicates a substring with $a$ zeros and $b$ ones; the problem admits $O(1)$ queries using a compact index built on interval properties and prefix normal forms, with prior work by Cunha et al. achieving $O(n+ρ^2)$ time. The authors introduce SfTree, a suffix-tree-based indexing approach that eliminates duplicate substrings and leverages vectorization to reduce memory latency, achieving practical speedups while retaining similar asymptotic bounds. They prove the average number of runs in a binary string is $ρ = n/4$, yielding an average-case complexity of $Θ(n^2)$ for run-based indexing, and establish a lower bound $Ω(n+ρ^2)$ for such approaches. Empirical results on random and structured inputs show SfTree substantially outperforms the baseline, with potential extensions to other suffix-tree applications.

Abstract

Given a binary string $ω$ over the alphabet $\{0, 1\}$, a vector $(a, b)$ is a Parikh vector if and only if a factor of $ω$ contains exactly $a$ occurrences of $0$ and $b$ occurrences of $1$. Answering whether a vector is a Parikh vector of $ω$ is known as the Binary Jumbled Indexing Problem (BJPMP) or the Histogram Indexing Problem. Most solutions to this problem rely on an $O(n)$ word-space index to answer queries in constant time, encoding the Parikh set of $ω$, i.e., all its Parikh vectors. Cunha et al. (Combinatorial Pattern Matching, 2017) introduced an algorithm (JBM2017), which computes the index table in $O(n+ρ^2)$ time, where $ρ$ is the number of runs of identical digits in $ω$, leading to $O(n^2)$ in the worst case. We prove that the average number of runs $ρ$ is $n/4$, confirming the quadratic behavior also in the average-case. We propose a new algorithm, SFTree, which uses a suffix tree to remove duplicate substrings. Although SFTree also has an average-case complexity of $Θ(n^2)$ due to the fundamental reliance on run boundaries, it achieves practical improvements by minimizing memory access overhead through vectorization. The suffix tree further allows distinct substrings to be processed efficiently, reducing the effective cost of memory access. As a result, while both algorithms exhibit similar theoretical growth, SFTree significantly outperforms others in practice. Our analysis highlights both the theoretical and practical benefits of the SFTree approach, with potential extensions to other applications of suffix trees.

Binary Jumbled Indexing: Suffix tree histogram

TL;DR

This work studies Binary Jumbled Indexing (BJI) for binary strings, where a Parikh vector

indicates a substring with

zeros and

ones; the problem admits

queries using a compact index built on interval properties and prefix normal forms, with prior work by Cunha et al. achieving

time. The authors introduce SfTree, a suffix-tree-based indexing approach that eliminates duplicate substrings and leverages vectorization to reduce memory latency, achieving practical speedups while retaining similar asymptotic bounds. They prove the average number of runs in a binary string is

, yielding an average-case complexity of

for run-based indexing, and establish a lower bound

for such approaches. Empirical results on random and structured inputs show SfTree substantially outperforms the baseline, with potential extensions to other suffix-tree applications.

Abstract

Given a binary string

over the alphabet

, a vector

is a Parikh vector if and only if a factor of

contains exactly

occurrences of

and

occurrences of

. Answering whether a vector is a Parikh vector of

is known as the Binary Jumbled Indexing Problem (BJPMP) or the Histogram Indexing Problem. Most solutions to this problem rely on an

word-space index to answer queries in constant time, encoding the Parikh set of

, i.e., all its Parikh vectors. Cunha et al. (Combinatorial Pattern Matching, 2017) introduced an algorithm (JBM2017), which computes the index table in

time, where

is the number of runs of identical digits in

, leading to

in the worst case. We prove that the average number of runs

, confirming the quadratic behavior also in the average-case. We propose a new algorithm, SFTree, which uses a suffix tree to remove duplicate substrings. Although SFTree also has an average-case complexity of

due to the fundamental reliance on run boundaries, it achieves practical improvements by minimizing memory access overhead through vectorization. The suffix tree further allows distinct substrings to be processed efficiently, reducing the effective cost of memory access. As a result, while both algorithms exhibit similar theoretical growth, SFTree significantly outperforms others in practice. Our analysis highlights both the theoretical and practical benefits of the SFTree approach, with potential extensions to other applications of suffix trees.

Paper Structure (23 sections, 3 theorems, 3 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 3 theorems, 3 equations, 3 figures, 3 tables, 1 algorithm.

Introduction
Contributions.
Organization.
Preliminaries
Parikh set and Parikh vectors.
Binary Jumbled Indexing
Prefix normal forms and Prefix normal words.
Cunha et al.'s algorithm
Suffix tree and special pattern encode
Special pattern encode.
Building the suffix tree.
Binary Jumbled Indexing: Algorithm
Time complexity analysis.
Proving $\rho^2$ as lower bound for indexing table.
Practical results and discussions
...and 8 more sections

Key Result

theorem 1

The average number of runs $\rho$ in a string with size $n$ is $\frac{n}{4}$.

Figures (3)

Figure 1: Build in: https://brenden.github.io/ukkonen-animation/. The node labels correspond to the steps of Ukkonen's algorithm.
Figure 2: Execution time comparison between JBM2017 and SFTree algorithms. The y-axis represents execution time in seconds, while the x-axis represents the input size in thousands of digits. The graph illustrates the simulated asymptotic growth curves for both algorithms.
Figure 3: Asymptotic comparison between JBM2017 and SFTree in range(50, 25000, 50)

Theorems & Definitions (6)

theorem 1
proof
lemma thmcounterlemma
proof
theorem 2
proof

Binary Jumbled Indexing: Suffix tree histogram

TL;DR

Abstract

Binary Jumbled Indexing: Suffix tree histogram

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (6)