Maximal number of subword occurrences in a word
Wenjie Fang
TL;DR
This work investigates the maximal density of subword occurrences in words by introducing subword entropy $S_{\mathrm{sw}}(w)=\log_2\operatorname{maxocc}(w)$ and studying the minimal entropy over words of length $n$ on an alphabet of size $k$. It proves the limit $L_k=\lim_{n\to\infty}\min S_{\mathrm{sw}}^{(k)}(n)/n$ exists and is finite, deriving foundational bounds and revealing a richer structure through the binary case. The authors derive sharper upper bounds for $L_2$ by analyzing periodic word families and computing explicit generating functions and saddle-point asymptotics for maximal subword occurrences, complemented by extensive experimental data. They also establish a general rationality result for $f_{w,v}(x,y)$, enabling a systematic analytic-combinatorics-based approach to asymptotics, and pose open questions and conjectures about the structure of minimizing words and the growth of $\min S_{\mathrm{sw}}^{(k)}(n)$. The findings advance understanding of pattern density in words and provide a framework for future precise characterizations of subword occurrence limits.
Abstract
We consider the number of occurrences of subwords (non-consecutive sub-sequences) in a given word. We first define the notion of subword entropy of a given word that measures the maximal number of occurrences among all possible subwords. We then give upper and lower bounds of minimal subword entropy for words of fixed length in a fixed alphabet, and also showing that minimal subword entropy per letter has a limit value. A better upper bound of minimal subword entropy for a binary alphabet is then given by looking at certain families of periodic words. We also give some conjectures based on experimental observations.
