Table of Contents
Fetching ...

On the Statistical Query Complexity of Learning Semiautomata: a Random Walk Approach

George Giapitzakis, Kimon Fountoulakis, Eshaan Nichani, Jason D. Lee

TL;DR

The paper investigates the hardness of learning semiautomata with $N$ states under the uniform distribution over input words and initial states within the Statistical Query (SQ) framework. By mapping semiautomata to random walks on the product group $S_N \times S_N$ and employing Fourier analysis on the symmetric group, the authors identify a key irreducible representation that governs indistinguishability, and prove tight spectral-gap bounds showing mixing after $T = \Omega(N^2 \log N)$ steps. They construct a randomized $$(k,M)$$-shuffle hard set with $M = N!$ and alphabet size $|\Sigma| = \Omega(N^3 \log N)$, achieving a final-state agreement probability $P_{\mathrm{agree}}(T) = \frac{1}{N} + \frac{1}{N}\boldsymbol{v}^\top M_{\Pi_0}^T \boldsymbol{v}$ that becomes exponentially close to $1/N$; this yields a statistical dimension of $N!$ and thus SQ hardness. Consequently, any SQ learner under the uniform distribution must either make super-polynomial queries or use super-polynomially small tolerance, highlighting a fundamental structural barrier to learning semiautomata. The work connects automata learning to group representation theory, providing precise mixing-time and spectral-gap characterizations with potential implications for understanding the limits of noise-tolerant learning and the role of internal transition structure in learnability.

Abstract

Semiautomata form a rich class of sequence-processing algorithms with applications in natural language processing, robotics, computational biology, and data mining. We establish the first Statistical Query hardness result for semiautomata under the uniform distribution over input words and initial states. We show that Statistical Query hardness can be established when both the alphabet size and input length are polynomial in the number of states. Unlike the case of deterministic finite automata, where hardness typically arises through the hardness of the language they recognize (e.g., parity), our result is derived solely from the internal state-transition structure of semiautomata. Our analysis reduces the task of distinguishing the final states of two semiautomata to studying the behavior of a random walk on the group $S_{N} \times S_{N}$. By applying tools from Fourier analysis and the representation theory of the symmetric group, we obtain tight spectral gap bounds, demonstrating that after a polynomial number of steps in the number of states, distinct semiautomata become nearly uncorrelated, yielding the desired hardness result.

On the Statistical Query Complexity of Learning Semiautomata: a Random Walk Approach

TL;DR

The paper investigates the hardness of learning semiautomata with states under the uniform distribution over input words and initial states within the Statistical Query (SQ) framework. By mapping semiautomata to random walks on the product group and employing Fourier analysis on the symmetric group, the authors identify a key irreducible representation that governs indistinguishability, and prove tight spectral-gap bounds showing mixing after steps. They construct a randomized -shuffle hard set with and alphabet size , achieving a final-state agreement probability that becomes exponentially close to ; this yields a statistical dimension of and thus SQ hardness. Consequently, any SQ learner under the uniform distribution must either make super-polynomial queries or use super-polynomially small tolerance, highlighting a fundamental structural barrier to learning semiautomata. The work connects automata learning to group representation theory, providing precise mixing-time and spectral-gap characterizations with potential implications for understanding the limits of noise-tolerant learning and the role of internal transition structure in learnability.

Abstract

Semiautomata form a rich class of sequence-processing algorithms with applications in natural language processing, robotics, computational biology, and data mining. We establish the first Statistical Query hardness result for semiautomata under the uniform distribution over input words and initial states. We show that Statistical Query hardness can be established when both the alphabet size and input length are polynomial in the number of states. Unlike the case of deterministic finite automata, where hardness typically arises through the hardness of the language they recognize (e.g., parity), our result is derived solely from the internal state-transition structure of semiautomata. Our analysis reduces the task of distinguishing the final states of two semiautomata to studying the behavior of a random walk on the group . By applying tools from Fourier analysis and the representation theory of the symmetric group, we obtain tight spectral gap bounds, demonstrating that after a polynomial number of steps in the number of states, distinct semiautomata become nearly uncorrelated, yielding the desired hardness result.

Paper Structure

This paper contains 29 sections, 26 theorems, 108 equations.

Key Result

Theorem 4.1

Consider the random walk corresponding to two semiautomata $\mathcal{A}$ and $\mathcal{A}'$ operating on the same alphabet $\Sigma$ and state-space $\mathcal{Q}$, as described in sec:two_dfa_walk. Let $N = |\mathcal{Q}|$ and let $T \in \mathbb{N}$ be the input length. Let $X_0 \sim \mathcal{U}(\math Then where $M_{\Pi_0}$ is the Fourier transform of the single-step distribution $T_{\operatorname{

Theorems & Definitions (71)

  • Definition 3.1: Statistical Query oracle
  • Definition 3.2: Statistical query hardness
  • Definition 4.1: Single-step probability
  • Theorem 4.1: Agreement probability
  • proof : Proof outline
  • Definition 5.1: The randomized $(k, M)$-shuffle family construction
  • Lemma 5.1
  • proof : Proof outline
  • Lemma 5.2
  • Theorem 5.1
  • ...and 61 more