Classifying Words with 3-sort Automata

Tomasz Jastrząb; Frédéric Lardeux; Eric Monfroy

Classifying Words with 3-sort Automata

Tomasz Jastrząb, Frédéric Lardeux, Eric Monfroy

TL;DR

This work addresses learning probabilistic automata from positive and negative word samples by leveraging finite automata with three state sorts: accepting, rejecting, and inconclusive. It develops a rigorous pipeline: infer a $3$-sort NFA, convert it to a $3$-sort weighted-frequency automaton, and finally derive a $3$-sort probabilistic automaton to classify words, with multiple path- and weight-based classifiers. The approach includes a size-reduction technique to keep automata compact and defines counting functions and normalization formulas to produce valid probabilities. Empirical evaluation on real peptide data and regex-generated languages demonstrates that probabilistic NFAs can achieve meaningful classification performance, with prefix-based models and certain classifiers yielding strong results; future work includes weight-tuning and ensemble strategies to enhance robustness and applicability.

Abstract

Grammatical inference consists in learning a language or a grammar from data. In this paper, we consider a number of models for inferring a non-deterministic finite automaton (NFA) with 3 sorts of states, that must accept some words, and reject some other words from a given sample. We then propose a transformation from this 3-sort NFA into weighted-frequency and probabilistic NFA, and we apply the latter to a classification task. The experimental evaluation of our approach shows that the probabilistic NFAs can be successfully applied for classification tasks on both real-life and superficial benchmark data sets.

Classifying Words with 3-sort Automata

TL;DR

-sort NFA, convert it to a

-sort weighted-frequency automaton, and finally derive a

-sort probabilistic automaton to classify words, with multiple path- and weight-based classifiers. The approach includes a size-reduction technique to keep automata compact and defines counting functions and normalization formulas to produce valid probabilities. Empirical evaluation on real peptide data and regex-generated languages demonstrates that probabilistic NFAs can achieve meaningful classification performance, with prefix-based models and certain classifiers yielding strong results; future work includes weight-tuning and ensemble strategies to enhance robustness and applicability.

Abstract

Paper Structure (17 sections, 6 equations, 1 figure, 5 tables)

This paper contains 17 sections, 6 equations, 1 figure, 5 tables.

Introduction
The NFA inference problem: first models
Notations
Core of the models
Building paths
The models
From $\mathcal{O}(k^3)$ to $\mathcal{O}((k+2)^2)$
From 3NFA to weighted-frequency NFA and probabilistic NFA
Weighted-frequency automata
From 3-sort NFA to weighted-frequency automata
Probabilistic automata
From weighted-frequency to probabilistic automata
Classifying words
Experimentation
Experiment I
...and 2 more sections

Figures (1)

Figure 1: Best accuracy and F1-score metrics obtained by all NFAs for the analyzed benchmark sets and classifiers $\mathcal{C}_{MM}$ (black), $\mathcal{C}_{MA}$ (gray), $\mathcal{C}_{SM}$ (light gray), and $\mathcal{C}_{SA}$.

Theorems & Definitions (3)

definition 1
definition 2: 3_NWFFA
definition 3: 3_NPFA

Classifying Words with 3-sort Automata

TL;DR

Abstract

Classifying Words with 3-sort Automata

Authors

TL;DR

Abstract

Table of Contents

Figures (1)

Theorems & Definitions (3)