Classifying Words with 3-sort Automata
Tomasz Jastrząb, Frédéric Lardeux, Eric Monfroy
TL;DR
This work addresses learning probabilistic automata from positive and negative word samples by leveraging finite automata with three state sorts: accepting, rejecting, and inconclusive. It develops a rigorous pipeline: infer a $3$-sort NFA, convert it to a $3$-sort weighted-frequency automaton, and finally derive a $3$-sort probabilistic automaton to classify words, with multiple path- and weight-based classifiers. The approach includes a size-reduction technique to keep automata compact and defines counting functions and normalization formulas to produce valid probabilities. Empirical evaluation on real peptide data and regex-generated languages demonstrates that probabilistic NFAs can achieve meaningful classification performance, with prefix-based models and certain classifiers yielding strong results; future work includes weight-tuning and ensemble strategies to enhance robustness and applicability.
Abstract
Grammatical inference consists in learning a language or a grammar from data. In this paper, we consider a number of models for inferring a non-deterministic finite automaton (NFA) with 3 sorts of states, that must accept some words, and reject some other words from a given sample. We then propose a transformation from this 3-sort NFA into weighted-frequency and probabilistic NFA, and we apply the latter to a classification task. The experimental evaluation of our approach shows that the probabilistic NFAs can be successfully applied for classification tasks on both real-life and superficial benchmark data sets.
