Table of Contents
Fetching ...

Approximate Cartesian Tree Matching with Substitutions

Panagiotis Charalampopoulos, Jonas Ellert, Manal Mohamed

TL;DR

This work addresses approximate Cartesian tree matching under substitutions, quantified by the Hamming distance, by introducing a CT-aware periodicity toolbox. The authors design a two-branch algorithm that, depending on the pattern's structure, either marks many candidate starts or trims periodic fragments to enable fast verification, ultimately achieving O(n √m · k^{2.5}) time for k ≤ m^{1/5} and O(n k^5) for larger k, improving over the previous O(n m k) bound in a broad range of regimes. The approach hinges on a novel CT-block-periodicity concept that yields strong locality guarantees and transfers key periodicity ideas from strings to Cartesian trees. The results offer a practical and scalable framework for robust CT-matching in time-series and related applications, supported by a toolbox with potential broader use in Cartesian-tree-based pattern analysis.

Abstract

The Cartesian tree of a sequence captures the relative order of the sequence's elements. In recent years, Cartesian tree matching has attracted considerable attention, particularly due to its applications in time series analysis. Consider a text $T$ of length $n$ and a pattern $P$ of length $m$. In the exact Cartesian tree matching problem, the task is to find all length-$m$ fragments of $T$ whose Cartesian tree coincides with the Cartesian tree $CT(P)$ of the pattern. Although the exact version of the problem can be solved in linear time [Park et al., TCS 2020], it remains rather restrictive; for example, it is not robust to outliers in the pattern. To overcome this limitation, we consider the approximate setting, where the goal is to identify all fragments of $T$ that are close to some string whose Cartesian tree matches $CT(P)$. In this work, we quantify closeness via the widely used Hamming distance metric. For a given integer parameter $k>0$, we present an algorithm that computes all fragments of $T$ that are at Hamming distance at most $k$ from a string whose Cartesian tree matches $CT(P)$. Our algorithm runs in time $\mathcal O(n \sqrt{m} \cdot k^{2.5})$ for $k \leq m^{1/5}$ and in time $\mathcal O(nk^5)$ for $k \geq m^{1/5}$, thereby improving upon the state-of-the-art $\mathcal O(nmk)$-time algorithm of Kim and Han [TCS 2025] in the regime $k = o(m^{1/4})$. On the way to our solution, we develop a toolbox of independent interest. First, we introduce a new notion of periodicity in Cartesian trees. Then, we lift multiple well-known combinatorial and algorithmic results for string matching and periodicity in strings to Cartesian tree matching and periodicity in Cartesian trees.

Approximate Cartesian Tree Matching with Substitutions

TL;DR

This work addresses approximate Cartesian tree matching under substitutions, quantified by the Hamming distance, by introducing a CT-aware periodicity toolbox. The authors design a two-branch algorithm that, depending on the pattern's structure, either marks many candidate starts or trims periodic fragments to enable fast verification, ultimately achieving O(n √m · k^{2.5}) time for k ≤ m^{1/5} and O(n k^5) for larger k, improving over the previous O(n m k) bound in a broad range of regimes. The approach hinges on a novel CT-block-periodicity concept that yields strong locality guarantees and transfers key periodicity ideas from strings to Cartesian trees. The results offer a practical and scalable framework for robust CT-matching in time-series and related applications, supported by a toolbox with potential broader use in Cartesian-tree-based pattern analysis.

Abstract

The Cartesian tree of a sequence captures the relative order of the sequence's elements. In recent years, Cartesian tree matching has attracted considerable attention, particularly due to its applications in time series analysis. Consider a text of length and a pattern of length . In the exact Cartesian tree matching problem, the task is to find all length- fragments of whose Cartesian tree coincides with the Cartesian tree of the pattern. Although the exact version of the problem can be solved in linear time [Park et al., TCS 2020], it remains rather restrictive; for example, it is not robust to outliers in the pattern. To overcome this limitation, we consider the approximate setting, where the goal is to identify all fragments of that are close to some string whose Cartesian tree matches . In this work, we quantify closeness via the widely used Hamming distance metric. For a given integer parameter , we present an algorithm that computes all fragments of that are at Hamming distance at most from a string whose Cartesian tree matches . Our algorithm runs in time for and in time for , thereby improving upon the state-of-the-art -time algorithm of Kim and Han [TCS 2025] in the regime . On the way to our solution, we develop a toolbox of independent interest. First, we introduce a new notion of periodicity in Cartesian trees. Then, we lift multiple well-known combinatorial and algorithmic results for string matching and periodicity in strings to Cartesian tree matching and periodicity in Cartesian trees.
Paper Structure (12 sections, 17 theorems, 5 equations, 3 figures)

This paper contains 12 sections, 17 theorems, 5 equations, 3 figures.

Key Result

Theorem 1

The Approximate CT-Matching with Substitutions problem can be solved in $\mathcal{O}(n \sqrt{m} \cdot k^{2.5})$ time for $k \in [1, \lfloor m^{1/5}\rfloor]$ and in $\mathcal{O}(nk^5)$ time for any $k \in [\lfloor m^{1/5}\rfloor, m]$.

Figures (3)

  • Figure 1: The running time $t(n,k)$ of algorithms for Approximate CT-Matching with Substitutions as a function of $k$ for the special case when $m= \Theta(n)$, shown on a doubly logarithmic scale.
  • Figure 2: The Cartesian tree $\textsf{CT}(X)$ of $X = [4, 5, 6, 1, 2, 7, 7, 8, 3, 9]$ (top left), the tree induced by $\textnormal{psv}_X$ (top right), and the values of $\textnormal{psv}_X$, $\textnormal{nsv}_X$, $\textnormal{PD}(X)$, and $\textnormal{ND}(X)$ (bottom right). For $Y = [14, 15, 16, 11, 12, 17, 17, 18, 13, 19]$, we have $X \approx Y$, illustrating that different sequences can yield the same Cartesian tree. In contrast, the Cartesian tree of $Z = [14, 15, 16, 16, 12, 17, 17, 18, 8, 19]$ (bottom left) is different, i.e., we have $Z \not\approx X$. However, we have $\textnormal{CHd}(X \leadsto Z) = 2$ because we can substitute the values of $X$ at positions 4 and 9 to obtain $X' = [4, 5, 6, \mathbf{6}, 2, 7, 7, 8, \mathbf{1}, 9]\approx Z$.
  • Figure 3: An illustration of the settings in \ref{['lem:no_sub_in_run', 'lem:trim']} with $k=3$. The $\textnormal{psv}_Y$ array is shown with grey arrows below string $Y$. \ref{['lem:no_sub_in_run']} establishes that any sequence of $\textnormal{CHd}(X \leadsto Y)$ substitutions that transforms $X$ into $X'$ such that ${X' \approx Y}$ does not modify the fragment highlighted in red; the condition "$\forall i \in [c_r + p, \lvert Y\rvert] : \textnormal{psv}_Y(i) \notin [c_1, c_r)$" ensures that none of these substitutions "interact" with this fragment. \ref{['lem:trim']} establishes that we can trim a pair of fragments (one in $X$ and one in $Y$) without altering the value of $\textnormal{CHd}(X \leadsto Y)$.

Theorems & Definitions (25)

  • Theorem 1
  • Lemma 2: Periodicity Lemma (weak version) fine1965uniqueness
  • Definition 3
  • Definition 4
  • Lemma 6: exactCTmatching
  • Lemma 7: exactCTmatching
  • Lemma 9: exactCTmatching
  • Definition 10: DBLP:journals/tcs/MatsuokaAIBT16
  • Lemma 11
  • Definition 12
  • ...and 15 more