Table of Contents
Fetching ...

Mass-Spring Models for Passive Keyword Spotting: A Springtronics Approach

Finn Bohte, Theophile Louvet, Vincent Maillou, Marc Serra Garcia

TL;DR

This work demonstrates a passive mass-spring computational framework, springtronics, capable of performing keyword spotting with competitive accuracy on a real 12-class speech benchmark using hundreds of degrees of freedom. It combines analogue feature extraction (mechanical Mel filters and cubic-root compression) with a continuous-time convolution realized via delay lines and a matrix-vector multiplication implemented by zero-modes, followed by a quadratic activation and leaky-integrator readout. Training leverages the reformulation of the convolution as a linear SVM, enabling efficient weight determination before mapping to the mass-spring hardware. The results show competitive accuracy relative to sub-milliwatt electronics and highlight energy-accuracy trade-offs, suggesting a viable path toward low-power mechanical computing on MEMS platforms and broader applications of the springtronics framework.

Abstract

Mechanical systems played a foundational role in computing history, and have regained interest due to their unique properties, such as low damping and the ability to process mechanical signals without transduction. However, recent efforts have primarily focused on elementary computations, implemented in systems based on pre-defined reservoirs, or in periodic systems such as arrays of buckling beams. Here, we numerically demonstrate a passive mechanical system -- in the form of a nonlinear mass-spring model -- that tackles a real-world benchmark for keyword spotting in speech signals. The model is organized in a hierarchical architecture combining feature extraction and continuous-time convolution, with each individual stage tailored to the physics of the considered mass-spring systems. For each step in the computation, a subsystem is designed by combining a small set of low-order polynomial potentials. These potentials act as fundamental components that interconnect a network of masses. In analogy to electronic circuit design, where complex functional circuits are constructed by combining basic components into hierarchical designs, we refer to this framework as springtronics. We introduce springtronic systems with hundreds of degrees of freedom, achieving speech classification accuracy comparable to existing sub-mW electronic systems.

Mass-Spring Models for Passive Keyword Spotting: A Springtronics Approach

TL;DR

This work demonstrates a passive mass-spring computational framework, springtronics, capable of performing keyword spotting with competitive accuracy on a real 12-class speech benchmark using hundreds of degrees of freedom. It combines analogue feature extraction (mechanical Mel filters and cubic-root compression) with a continuous-time convolution realized via delay lines and a matrix-vector multiplication implemented by zero-modes, followed by a quadratic activation and leaky-integrator readout. Training leverages the reformulation of the convolution as a linear SVM, enabling efficient weight determination before mapping to the mass-spring hardware. The results show competitive accuracy relative to sub-milliwatt electronics and highlight energy-accuracy trade-offs, suggesting a viable path toward low-power mechanical computing on MEMS platforms and broader applications of the springtronics framework.

Abstract

Mechanical systems played a foundational role in computing history, and have regained interest due to their unique properties, such as low damping and the ability to process mechanical signals without transduction. However, recent efforts have primarily focused on elementary computations, implemented in systems based on pre-defined reservoirs, or in periodic systems such as arrays of buckling beams. Here, we numerically demonstrate a passive mechanical system -- in the form of a nonlinear mass-spring model -- that tackles a real-world benchmark for keyword spotting in speech signals. The model is organized in a hierarchical architecture combining feature extraction and continuous-time convolution, with each individual stage tailored to the physics of the considered mass-spring systems. For each step in the computation, a subsystem is designed by combining a small set of low-order polynomial potentials. These potentials act as fundamental components that interconnect a network of masses. In analogy to electronic circuit design, where complex functional circuits are constructed by combining basic components into hierarchical designs, we refer to this framework as springtronics. We introduce springtronic systems with hundreds of degrees of freedom, achieving speech classification accuracy comparable to existing sub-mW electronic systems.

Paper Structure

This paper contains 18 sections, 14 equations, 8 figures.

Figures (8)

  • Figure 1: Visual depiction of springtronics elements: Masses are represented by circles and are connected through potentials. Dampings are typically not drawn, but can be depicted as a dashpot. Potentials ${V}(\mathbf{x})$ are positive definite and act on one or multiple masses, and are illustrated in the following ways: (a) Local harmonic potentials are depicted as springs connected to a support. For Duffing potentials, the spring is crossed by an arrow. For visual clarity, springs for local potentials can be omitted and the mass can be crossed by an arrow to indicate Duffing potentials. Linear coupling potentials are depicted as springs connecting two masses, and quadratic couplings are depicted as connecting lines overlaid with a triangle pointing towards $x_j$. Here, linearity refers to the force-displacement relation derived from the potentials. (b) The Cross Kerr coupling is depicted as a connection with two triangles pointing towards each other, and an asymmetric Cross Kerr coupling with two triangles pointing towards $x_j$. (c) Illustration of the leverage parameter $\alpha$. The coupling potentials are parameterized by a strength ($k, \gamma, \kappa$) and a leverage ($\alpha$). The leverage can be understood as the arm of a lever inserted between two segments of a spring that connects two masses.
  • Figure 2: Model architecture: (a) The model consists of a feature extraction stage and a classification stage. The $n$ time dependent features capture spectral and temporal patterns of the speech signal. The extraction method is based on log-Mel spectrograms. Next, a convolutional classifier, composed of a CNN and a readout layer, predicts the likelihood of a keyword being present in the sound signal. (b) The features are extracted through signal filtering, signal squaring and signal compression operations. First, a Mel filterbank is applied, splitting the signal into multiple frequency bands. The resulting signals are squared, followed by a cubic root compression. Finally, a lowpass filter is applied to the signal. (c) The convolutional kernel weights $\mathbf{W}$ are realized via an instantaneous matrix-vector multiplication of time delayed copies of the features. We use a squaring activation function $x\mapsto x^2$ in the convolutional layer. The readout layer linearly combines the convolutional layer outputs, and performs a leaky integration over time. This integration yields the model readout. For each class, a different model is trained. The final prediction is given by the model producing the largest readout value.
  • Figure 3: Mass-spring model overview: (a) Schematic representation of the speech classification mass-spring model. Subsystems are indicated with corresponding functionalities. (b) Amplitude response of the mass-spring approximation of the Mel filterbank. The mechanical filter responses are formed by combining two Lorentzians (as shown in Fig. \ref{['fig4']}(b)). (c) Example trajectory of a speech feature extracted by the mass-spring model (purple), with a frequency band-filtered speech signals obtained through the approximate Mel filterbank in (b) (orange). (d) Dispersion relation $\omega(k)$ for the delay line as per Eq. \ref{['eq:dispersion_rel']} (blue), with corresponding group velocity $v_g(k) = \frac{\partial \omega}{\partial k}(k)$ (red) for parameters $m=10^{-3}$, $k_c=10.0$, and $k_l = 10^{-5}$. (e) Discretized representation of the geometry for MVM unit cell from louvet2024reprogrammable. Note that the rectangles here represent masses that can move in-plane---corresponding to two springtronic degrees of freedom, and the straight lines represent geometric constraints. We construct an equivalent mass-spring model by assigning a large but finite stiffness to the geometric constraints, and then reducing the system to a set of input and output springtronic degrees of freedom.
  • Figure 4: Mechanical Mel filters: (a) Amplitude responses of the optimized two-mode filterbank with 8 filters (full line) and the target Mel filter responses (dotted line). (b) Amplitude response of the 4th filter above in gray, with the responses of the two individual modes it is composed of: ${\alpha \omega_a^2}/{(\omega_a^2 - \omega^2 + i \alpha\omega)}$ (blue) and ${\alpha \omega_b^2}/{(\omega_b^2 - \omega^2 + i \alpha \omega)}$ (red), the difference of which defines the filter as per Eq. \ref{['eq:filter-response']}. (c) Amplitude response of the 7th filter above (gray) and an optimized single mode---Lorentzian---filter (purple), with the corresponding Mel filter (dotted line). The Lorentzian response is non-zero left of the Mel filter band, and the width of the peak deviates more from the width of the triangular response compared to the two-mode filter. These properties are related; the quality factor of peak-normalized Lorentizians determines both the response at $\omega=0$ and the width of the peak. The response of the filter has an impact on the classification accuracy. For the classification problem described in Sec. \ref{['sec:performance']}, a two-mode filter achieves an accuracy of 81.4%, compared to 75.2% for a single-mode filter. As a reference, a digital Mel filterbank achives an accuracy of 82.0%.
  • Figure 5: Energetics of the quadratic coupling: (a) The system consists of a truncated chain of linearly coupled masses, modeling an infinite delay line, with impedance-matched damping on both ends to simulate an infinite chain length. The system is excited from the left terminal site $x_1$. We study how the energy dissipated at the right terminal site $x_n$ changes when replacing the last linear coupling with a quadratic coupling. (b) Log-log plot of energy dissipated at $x_n$ ($E_{\text{out}}$) against energy input ($E_{\text{in}}$) for varying input pulse amplitudes, comparing the quadratic coupling (purple) to linear coupling (black), with reference lines for squaring (red) and power $2/3$ exponentiation (blue). The gray area is bounded by $E_{\text{out}} = \frac{1}{2} E_{\text{in}}$, corresponding to complete energy transfer where half of the input energy is dissipated at each terminal site, and indicates the area inaccessible due to energy conservation.
  • ...and 3 more figures