On the Statistical Query Complexity of Learning Semiautomata: a Random Walk Approach
George Giapitzakis, Kimon Fountoulakis, Eshaan Nichani, Jason D. Lee
TL;DR
The paper investigates the hardness of learning semiautomata with $N$ states under the uniform distribution over input words and initial states within the Statistical Query (SQ) framework. By mapping semiautomata to random walks on the product group $S_N \times S_N$ and employing Fourier analysis on the symmetric group, the authors identify a key irreducible representation that governs indistinguishability, and prove tight spectral-gap bounds showing mixing after $T = \Omega(N^2 \log N)$ steps. They construct a randomized $$(k,M)$$-shuffle hard set with $M = N!$ and alphabet size $|\Sigma| = \Omega(N^3 \log N)$, achieving a final-state agreement probability $P_{\mathrm{agree}}(T) = \frac{1}{N} + \frac{1}{N}\boldsymbol{v}^\top M_{\Pi_0}^T \boldsymbol{v}$ that becomes exponentially close to $1/N$; this yields a statistical dimension of $N!$ and thus SQ hardness. Consequently, any SQ learner under the uniform distribution must either make super-polynomial queries or use super-polynomially small tolerance, highlighting a fundamental structural barrier to learning semiautomata. The work connects automata learning to group representation theory, providing precise mixing-time and spectral-gap characterizations with potential implications for understanding the limits of noise-tolerant learning and the role of internal transition structure in learnability.
Abstract
Semiautomata form a rich class of sequence-processing algorithms with applications in natural language processing, robotics, computational biology, and data mining. We establish the first Statistical Query hardness result for semiautomata under the uniform distribution over input words and initial states. We show that Statistical Query hardness can be established when both the alphabet size and input length are polynomial in the number of states. Unlike the case of deterministic finite automata, where hardness typically arises through the hardness of the language they recognize (e.g., parity), our result is derived solely from the internal state-transition structure of semiautomata. Our analysis reduces the task of distinguishing the final states of two semiautomata to studying the behavior of a random walk on the group $S_{N} \times S_{N}$. By applying tools from Fourier analysis and the representation theory of the symmetric group, we obtain tight spectral gap bounds, demonstrating that after a polynomial number of steps in the number of states, distinct semiautomata become nearly uncorrelated, yielding the desired hardness result.
