Table of Contents
Fetching ...

Linguistic Predictability and Search Complexity: How Linguistic Redundancy Constraints the Landscape of Classical and Quantum Search

Alessio Di Santo, Gabriella Lanziani

TL;DR

The study addresses how linguistic regularities shape the computational difficulty of decrypting monoalphabetic substitution ciphers by leveraging four Renaissance Italian corpora to build character $n$-gram distributions and a corpus-driven metric $p_{\text{good}}$. It combines classical optimization (hill climbing, simulated annealing), Grover-style quantum-inspired estimates, and a QUBO annealing framework within a unified scoring function $S(\pi)=\log_{10} P_{\text{LM}}(\pi(c_{1:L}))$, linking language redundancy to search-space contraction. Key findings show longer texts produce more concentrated score distributions, driving $p_{\text{good}}$ down and conforming to a $N_{\text{oracle}} \sim 1/\sqrt{p_{\text{good}}}$ scaling across corpora; classical and quantum-inspired methods converge on similar high-scoring regions, underscoring the impact of linguistic structure on search. The framework offers a practical, corpus-driven benchmark for comparing classical and quantum search dynamics and suggests broad applicability to cross-linguistic studies of computational complexity in language tasks.

Abstract

This study examines the quantitative relationship between linguistic regularities and computational search complexity through a hybrid classical-quantum framework applied to Renaissance Italian texts. Using four representative works from the fifteenth and sixteenth centuries-Il Principe (Machiavelli), Il Cortegiano (Castiglione), I Ricordi (Guicciardini), and Orlando Furioso (Ariosto)-we construct character-based n-gram models under both a historically grounded 25-letter orthography and the full modern Italian alphabet. These models provide corpus-derived probabilistic baselines for evaluating substitution-cipher search processes. Combining classical hill climbing and simulated annealing with Grover-style quantum-inspired estimates and a QUBO annealing formulation, we quantify how the probability that a key produces a linguistically plausible decryption (pgood) relates to expected computational effort. Across cipher lengths from 200 to 1000 characters, empirical results confirm the predicted dependence of Grover oracle calls on 1/sqrt(pgood) and show that longer texts yield sharper score distributions and smaller feasible key regions. Overall, the findings establish a link between linguistic redundancy and search-space contraction, providing an empirical framework for comparing classical, quantum-inspired, and idealized quantum search dynamics under unified corpus-driven constraints.

Linguistic Predictability and Search Complexity: How Linguistic Redundancy Constraints the Landscape of Classical and Quantum Search

TL;DR

The study addresses how linguistic regularities shape the computational difficulty of decrypting monoalphabetic substitution ciphers by leveraging four Renaissance Italian corpora to build character -gram distributions and a corpus-driven metric . It combines classical optimization (hill climbing, simulated annealing), Grover-style quantum-inspired estimates, and a QUBO annealing framework within a unified scoring function , linking language redundancy to search-space contraction. Key findings show longer texts produce more concentrated score distributions, driving down and conforming to a scaling across corpora; classical and quantum-inspired methods converge on similar high-scoring regions, underscoring the impact of linguistic structure on search. The framework offers a practical, corpus-driven benchmark for comparing classical and quantum search dynamics and suggests broad applicability to cross-linguistic studies of computational complexity in language tasks.

Abstract

This study examines the quantitative relationship between linguistic regularities and computational search complexity through a hybrid classical-quantum framework applied to Renaissance Italian texts. Using four representative works from the fifteenth and sixteenth centuries-Il Principe (Machiavelli), Il Cortegiano (Castiglione), I Ricordi (Guicciardini), and Orlando Furioso (Ariosto)-we construct character-based n-gram models under both a historically grounded 25-letter orthography and the full modern Italian alphabet. These models provide corpus-derived probabilistic baselines for evaluating substitution-cipher search processes. Combining classical hill climbing and simulated annealing with Grover-style quantum-inspired estimates and a QUBO annealing formulation, we quantify how the probability that a key produces a linguistically plausible decryption (pgood) relates to expected computational effort. Across cipher lengths from 200 to 1000 characters, empirical results confirm the predicted dependence of Grover oracle calls on 1/sqrt(pgood) and show that longer texts yield sharper score distributions and smaller feasible key regions. Overall, the findings establish a link between linguistic redundancy and search-space contraction, providing an empirical framework for comparing classical, quantum-inspired, and idealized quantum search dynamics under unified corpus-driven constraints.

Paper Structure

This paper contains 25 sections, 6 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Normalization and modeling pipeline
  • Figure 2: Schematic representation of the key search space.
  • Figure 3: Distributions of normalized trigram scores for randomly sampled keys at five cipher lengths.
  • Figure 4: Estimated fraction of good keys $p_{\text{good}}$ as a function of the score threshold $\tau$.
  • Figure 5: Estimated Grover oracle calls as a function of the threshold $\tau$.
  • ...and 4 more figures