Table of Contents
Fetching ...

Learning DFAs from Positive Examples Only via Word Counting

Benjamin Bordais, Daniel Neider

TL;DR

This paper investigates learning DFAs from positive examples only by introducing a word-counting criterion: minimizing the number of accepted words up to length 2n−2 to drive DFA selection. It proves NP-completeness for the minimal-word-count problem, linking this perspective to the existing language-minimality goal and enabling binary-search strategies over SAT formulations. An ILP encoding is developed to solve the task, though empirically it lags behind state-of-the-art symbolic methods, while a novel preprocessing heuristic based on transition-system scoring shows promise as a practical accelerator. Overall, the work provides both a theoretical foundation and a pragmatic direction for improving DFA learning from positives, highlighting the value of count-based metrics alongside traditional language-based criteria.

Abstract

Learning finite automata from positive examples has recently gained attention as a powerful approach for understanding, explaining, analyzing, and verifying black-box systems. The motivation for focusing solely on positive examples arises from the practical limitation that we can only observe what a system is capable of (positive examples) but not what it cannot do (negative examples). Unlike the classical problem of passive DFA learning with both positive and negative examples, which has been known to be NP-complete since the 1970s, the topic of learning DFAs exclusively from positive examples remains poorly understood. This paper introduces a novel perspective on this problem by leveraging the concept of counting the number of accepted words up to a carefully determined length. Our contributions are twofold. First, we prove that computing the minimal number of words up to this length accepted by DFAs of a given size that accept all positive examples is NP-complete, establishing that learning from positive examples alone is computationally demanding. Second, we propose a new learning algorithm with a better asymptotic runtime that the best-known bound for existing algorithms. While our experimental evaluation reveals that this algorithm under-performs state-of-the-art methods, it demonstrates significant potential as a preprocessing step to enhance existing approaches.

Learning DFAs from Positive Examples Only via Word Counting

TL;DR

This paper investigates learning DFAs from positive examples only by introducing a word-counting criterion: minimizing the number of accepted words up to length 2n−2 to drive DFA selection. It proves NP-completeness for the minimal-word-count problem, linking this perspective to the existing language-minimality goal and enabling binary-search strategies over SAT formulations. An ILP encoding is developed to solve the task, though empirically it lags behind state-of-the-art symbolic methods, while a novel preprocessing heuristic based on transition-system scoring shows promise as a practical accelerator. Overall, the work provides both a theoretical foundation and a pragmatic direction for improving DFA learning from positives, highlighting the value of count-based metrics alongside traditional language-based criteria.

Abstract

Learning finite automata from positive examples has recently gained attention as a powerful approach for understanding, explaining, analyzing, and verifying black-box systems. The motivation for focusing solely on positive examples arises from the practical limitation that we can only observe what a system is capable of (positive examples) but not what it cannot do (negative examples). Unlike the classical problem of passive DFA learning with both positive and negative examples, which has been known to be NP-complete since the 1970s, the topic of learning DFAs exclusively from positive examples remains poorly understood. This paper introduces a novel perspective on this problem by leveraging the concept of counting the number of accepted words up to a carefully determined length. Our contributions are twofold. First, we prove that computing the minimal number of words up to this length accepted by DFAs of a given size that accept all positive examples is NP-complete, establishing that learning from positive examples alone is computationally demanding. Second, we propose a new learning algorithm with a better asymptotic runtime that the best-known bound for existing algorithms. While our experimental evaluation reveals that this algorithm under-performs state-of-the-art methods, it demonstrates significant potential as a preprocessing step to enhance existing approaches.

Paper Structure

This paper contains 31 sections, 12 theorems, 75 equations, 5 figures, 1 algorithm.

Key Result

Proposition 4

For an alphabet $\Sigma$, a set $\mathcal{P} \subseteq \Sigma^*$, $n \geq 1$, and $h := 2n-2$, a DFA $\mathcal{A} \in \mathsf{Rec}(\mathcal{P},n)$ that is $\mathsf{Rec}(\mathcal{P},n)$-minimal w.r.t. $\prec_h$ is $\mathsf{Rec}(\mathcal{P},n)$-minimal w.r.t. $\prec_\mathcal{L}$.

Figures (5)

  • Figure 1: The shape of a DFA $\mathcal{A}_{\mathsf{ex}}$.
  • Figure 2: The comparison on the runtime of the symbolic and ILP algorithms.
  • Figure 3: The runtime comparison of the symbolic and the heuristic+symbolic$^*$ algorithms.
  • Figure 4: The quotient of the runtime of the symbolic(*) algorithms as a function of the starting score.
  • Figure 5: A DFA $\mathcal{A}(X,\mathcal{C},k,d,T,M,\nu)$.

Theorems & Definitions (36)

  • Definition 1
  • Definition 3
  • Proposition 4
  • Theorem 5: Folk result
  • proof
  • Lemma 8
  • Theorem 9
  • proof
  • Proposition 10
  • proof
  • ...and 26 more