Table of Contents
Fetching ...

Classifying Words with 3-sort Automata

Tomasz Jastrząb, Frédéric Lardeux, Eric Monfroy

TL;DR

This work addresses learning probabilistic automata from positive and negative word samples by leveraging finite automata with three state sorts: accepting, rejecting, and inconclusive. It develops a rigorous pipeline: infer a $3$-sort NFA, convert it to a $3$-sort weighted-frequency automaton, and finally derive a $3$-sort probabilistic automaton to classify words, with multiple path- and weight-based classifiers. The approach includes a size-reduction technique to keep automata compact and defines counting functions and normalization formulas to produce valid probabilities. Empirical evaluation on real peptide data and regex-generated languages demonstrates that probabilistic NFAs can achieve meaningful classification performance, with prefix-based models and certain classifiers yielding strong results; future work includes weight-tuning and ensemble strategies to enhance robustness and applicability.

Abstract

Grammatical inference consists in learning a language or a grammar from data. In this paper, we consider a number of models for inferring a non-deterministic finite automaton (NFA) with 3 sorts of states, that must accept some words, and reject some other words from a given sample. We then propose a transformation from this 3-sort NFA into weighted-frequency and probabilistic NFA, and we apply the latter to a classification task. The experimental evaluation of our approach shows that the probabilistic NFAs can be successfully applied for classification tasks on both real-life and superficial benchmark data sets.

Classifying Words with 3-sort Automata

TL;DR

This work addresses learning probabilistic automata from positive and negative word samples by leveraging finite automata with three state sorts: accepting, rejecting, and inconclusive. It develops a rigorous pipeline: infer a -sort NFA, convert it to a -sort weighted-frequency automaton, and finally derive a -sort probabilistic automaton to classify words, with multiple path- and weight-based classifiers. The approach includes a size-reduction technique to keep automata compact and defines counting functions and normalization formulas to produce valid probabilities. Empirical evaluation on real peptide data and regex-generated languages demonstrates that probabilistic NFAs can achieve meaningful classification performance, with prefix-based models and certain classifiers yielding strong results; future work includes weight-tuning and ensemble strategies to enhance robustness and applicability.

Abstract

Grammatical inference consists in learning a language or a grammar from data. In this paper, we consider a number of models for inferring a non-deterministic finite automaton (NFA) with 3 sorts of states, that must accept some words, and reject some other words from a given sample. We then propose a transformation from this 3-sort NFA into weighted-frequency and probabilistic NFA, and we apply the latter to a classification task. The experimental evaluation of our approach shows that the probabilistic NFAs can be successfully applied for classification tasks on both real-life and superficial benchmark data sets.
Paper Structure (17 sections, 6 equations, 1 figure, 5 tables)

This paper contains 17 sections, 6 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Best accuracy and F1-score metrics obtained by all NFAs for the analyzed benchmark sets and classifiers $\mathcal{C}_{MM}$ (black), $\mathcal{C}_{MA}$ (gray), $\mathcal{C}_{SM}$ (light gray), and $\mathcal{C}_{SA}$.

Theorems & Definitions (3)

  • definition 1
  • definition 2: 3_NWFFA
  • definition 3: 3_NPFA