Fast and General Automatic Differentiation for Finite-State Methods

Lucas Ondel Yang; Tina Raissi; Martin Kocour; Pablo Riera; Caio Corro

Fast and General Automatic Differentiation for Finite-State Methods

Lucas Ondel Yang, Tina Raissi, Martin Kocour, Pablo Riera, Caio Corro

TL;DR

The paper tackles the bottleneck of automatic differentiation for semiring-based dynamic programming in structured prediction. By introducing the morphism-trick, it flattens the backward pass when the semiring's additive monoid is real-line isomorphic, enabling semiring-agnostic vector-Jacobian products and large-scale, memory-efficient gradients for finite-state methods. They present a general DP formulation for WFSA weights nu(A) and derive a morphism-based differentiation strategy, yielding orders-of-magnitude speedups over standard AD approaches and providing an open-source implementation in TensorAutomata.jl. The approach extends to several semiring families, including log-semirings and multi-valued semirings, with caveats for idempotent semirings, broadening the applicability of gradient-based learning in structured prediction tasks.

Abstract

We propose a new method, that we coined the ``morphism-trick'', to integrate custom implementations of vector-Jacobian products in automatic differentiation softwares, applicable to a wide range of semiring-based computations. Our approach leads to efficient and semiring-agnostic implementations of the backward pass of dynamic programming algorithms. For the particular case of finite-state methods, we introduce an algorithm that computes and differentiates the $\oplus$-sum of all paths' weight of a finite-state automaton. Results show that, with minimal effort from the user, our novel library allows computing the gradient of a function w.r.t. to the weights of a finite state automaton orders of magnitude faster than state-of-the-art automatic differentiation systems. Implementations are made available via an open-source library distributed under a permissive license.

Fast and General Automatic Differentiation for Finite-State Methods

TL;DR

Abstract

-sum of all paths' weight of a finite-state automaton. Results show that, with minimal effort from the user, our novel library allows computing the gradient of a function w.r.t. to the weights of a finite state automaton orders of magnitude faster than state-of-the-art automatic differentiation systems. Implementations are made available via an open-source library distributed under a permissive license.

Paper Structure (17 sections, 2 theorems, 38 equations, 7 figures, 2 algorithms)

This paper contains 17 sections, 2 theorems, 38 equations, 7 figures, 2 algorithms.

Introduction
Related Works
Preliminaries
Automatic Differentiation of the Semiring Dot Product
Differentiation via the "Morphism-Trick"
Applicable Semirings
Semirings Isomorphic to the Real Line
Multi-valued Semiring
Idempotent Semirings
Differentiating DP Algorithms with Finite State Methods
Computing $\nu(\mathcal{A})$
Differentiation with the morphism-trick
Conclusion
Acknowledgement
Reverse-Mode differentiation
...and 2 more sections

Key Result

proposition 1

(Morphism-trick) Let $(S, \oplus, \otimes, \bar{0}, \bar{1})$ be a semiring such that the monoid $(S, \oplus, \bar{0})$ is isomorphic to $(\mathbb R, +, 0)$, and let $\mu: S \to \mathbb R$ be the associated morphism and $\mu^{-1}$ its inverse. For all $\mathbf{x}, \mathbf{y} \in S^K$, if then:

Figures (7)

Figure 1: (left) Run-time of the product under the log-semiring (forward) and the AD of the computation with Zygote Innes2018, Enzyme Moses2020 and Zygote augmented with a generic vector-jacobian product (our proposed method). Our approach scales well as it bypasses the needs to allocate and maintain the computational graph in memory. (right) Number of heap allocations realized by the different AD methods. Note that the computation of the semiring dot-product itself does not require dynamic memory allocations.
Figure 2: Computation graph of the backward step of $z$ via the morphism-trick.
Figure 3: (left) Run-time of computing $\nu(\mathcal{A})$ under the log-semiring (forward) and the AD of the computation, Enzyme Moses2020 and Zygote augmented with a general vector-jacobian product (proposed). (right) Number of heap allocations realized by the different AD methods. For both plots, $K$ represents the number of states of plus the number of transitions of $\mathcal{A}$. The automata used for this benchmark were artificially created by iterated concatenations of the automaton illustrated in Fig. \ref{['fig:example_automaton']} (Appendix \ref{['app:gradient_examples']}).
Figure 4: Examples of gradient of $\nu(\mathcal{A})$ (see $\mathcal{A}$ in Fig. \ref{['fig:ad_log']}) under the tropical and arctic semirings. Dashed gray lines indicates zero-value partial derivatives.
Figure 5: (a) finite state automaton $\mathcal{B}$ over the log-expectation semiring: $(\mathbb{R} \times \mathbb{R}, \oplus, \otimes, (-\infty, 0), (0, 0))$ where $(x, a) \oplus (y, b) = (\log(e^x + e^y), a + b)$ and $(x, a) \otimes (y, b) = (x + y, e^x b + e^y a)$. (b) Gradient of $z$ where $\nu(\mathcal{B}) = (z, c)$ with respect to the parameters of $\mathcal{B}$.
...and 2 more figures

Theorems & Definitions (4)

proposition 1
proof
proposition 2
proof

Fast and General Automatic Differentiation for Finite-State Methods

TL;DR

Abstract

Fast and General Automatic Differentiation for Finite-State Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (4)