The categorical contours of the Chomsky-Schützenberger representation theorem
Paul-André Melliès, Noam Zeilberger
TL;DR
This work recasts context-free grammars and nondeterministic finite-state automata within a unified categorical framework, treating grammars as functors from free operads and automata as finitary ULF functors, thereby enabling CFGs and NFAs to be studied over arbitrary categories and operads. The key innovations are the operad of spliced arrows, the contour category with its adjunction to splicing, and the universal tree contour language, which together yield a generalized Chomsky–Schützenberger representation: a language is a CFL of arrows iff it is a functorial image of the intersection of a tree contour language with a regular language. The authors develop closure properties, a fibrational parsing view, and a pullback construction that jointly handle the intersection of CFLs with regular languages, and they extend the framework to generalized CFGs over arbitrary operads, including multiple and parallel CFGs. Collectively, the results provide a robust categorical foundation for formal languages, automata, and parsing, with potential implications for type systems, LR parsing, and cross-domain applications in category theory and computer science.
Abstract
We develop fibrational perspectives on context-free grammars and on nondeterministic finite-state automata over categories and operads. A generalized CFG is a functor from a free colored operad (aka multicategory) generated by a pointed finite species into an arbitrary base operad: this encompasses classical CFGs by taking the base to be a certain operad constructed from a free monoid, as an instance of a more general construction of an \emph{operad of spliced arrows} $\mathcal{W}\,\mathcal{C}$ for any category $\mathcal{C}$. A generalized NFA is a functor from an arbitrary bipointed category or pointed operad satisfying the unique lifting of factorizations and finite fiber properties: this encompasses classical word automata and tree automata without $ε$-transitions, but also automata over non-free categories and operads. We show that generalized context-free and regular languages satisfy suitable generalizations of many of the usual closure properties, and in particular we give a simple conceptual proof that context-free languages are closed under intersection with regular languages. Finally, we observe that the splicing functor $\mathcal{W} : Cat \to Oper$ admits a left adjoint $\mathcal{C}: Oper \to Cat$, which we call the \emph{contour category} construction since the arrows of $\mathcal{C}\,\mathcal{O}$ have a geometric interpretation as oriented contours of operations of $\mathcal{O}$. A direct consequence of the contour / splicing adjunction is that every pointed finite species induces a universal CFG generating a language of \emph{tree contour words.} This leads us to a generalization of the Chomsky-Schützenberger Representation Theorem, establishing that a subset of a homset $L \subseteq \mathcal{C}(A,B)$ is a CFL of arrows if and only if it is a functorial image of the intersection of a $\mathcal{C}$-chromatic tree contour language with a regular language.
