Table of Contents
Fetching ...

Beyond Deep Learning: Speech Segmentation and Phone Classification with Neural Assemblies

Trevor Adelson, Vidhyasaharan Sethu, Ting Dang

Abstract

Deep learning dominates speech processing but relies on massive datasets, global backpropagation-guided weight updates, and produces entangled representations. Assembly Calculus (AC), which models sparse neuronal assemblies via Hebbian plasticity and winner-take-all competition, offers a biologically grounded alternative, yet prior work focused on discrete symbolic inputs. We introduce an AC-based speech processing framework that operates directly on continuous speech by combining three key contributions:(i) neural encoding that converts speech into assembly-compatible spike patterns using probabilistic mel binarisation and population-coded MFCCs; (ii) a multi-area architecture organising assemblies across hierarchical timescales and classes; and (iii) cross-area update schemes for downstream tasks. Applied to two core tasks of boundary detection and segment classification, our framework detects phone (F1=0.69) and word (F1=0.61) boundaries without any weight training, and achieves 47.5% and 45.1% accuracy on phone and command recognition. These results show that AC-based dynamical systems are a viable alternative to deep learning for speech processing.

Beyond Deep Learning: Speech Segmentation and Phone Classification with Neural Assemblies

Abstract

Deep learning dominates speech processing but relies on massive datasets, global backpropagation-guided weight updates, and produces entangled representations. Assembly Calculus (AC), which models sparse neuronal assemblies via Hebbian plasticity and winner-take-all competition, offers a biologically grounded alternative, yet prior work focused on discrete symbolic inputs. We introduce an AC-based speech processing framework that operates directly on continuous speech by combining three key contributions:(i) neural encoding that converts speech into assembly-compatible spike patterns using probabilistic mel binarisation and population-coded MFCCs; (ii) a multi-area architecture organising assemblies across hierarchical timescales and classes; and (iii) cross-area update schemes for downstream tasks. Applied to two core tasks of boundary detection and segment classification, our framework detects phone (F1=0.69) and word (F1=0.61) boundaries without any weight training, and achieves 47.5% and 45.1% accuracy on phone and command recognition. These results show that AC-based dynamical systems are a viable alternative to deep learning for speech processing.
Paper Structure (23 sections, 8 equations, 9 figures, 2 tables)

This paper contains 23 sections, 8 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Fundamental Project operation of Assembly Calculus. Binary input vectors are projected into a neural area where the $k$-cap operation selects the top-$k$ most activated neurons to form a sparse assembly. Recurrent plasticity strengthens connections between co-active neurons, causing assemblies to stabilise over repeated presentations of the same input.
  • Figure 2: Inside a per-class RecurrentArea $c$ over a period of $n$ frames. $C$ independent areas (one per class) process the same input in parallel. Each neuron receives feedforward drive from the current input $\mathbf{x}^{(t)}$ plus recurrent drive from the previous assembly $\mathbf{a}_{c}^{(t{-}1)}$; the top-$k$ neurons fire ($k$-cap). Plasticity strengthens co-active feedforward and recurrent edges, causing each area to learn class-specific spectro-temporal trajectories. The area with learned dynamics that best match the input (highest resonance score $R_c$) determines the predicted class.
  • Figure 3: Conceptual overview. Speech is processed by two separate AC pipelines. Top: binarised mel frames drive a frozen-weight refractory hierarchy; the change signal $c^{(t)}$ marks boundaries ($\beta{=}0$). Bottom: population-coded MFCCs drive per-class RecurrentAreas; resonance scoring $R_c$ identifies the class ($\beta{>}0$). Columns mark the three contributions: (i) neural encoding, (ii) area architecture, (iii) task read-out.
  • Figure 4: Probabilistic mel binarisation applied to a single phone segment (/eh/, 133 ms). Left: continuous mel spectrogram normalised to $[0,1]$. Right: binary spike pattern obtained by treating each mel bin value as a Bernoulli firing probability. Brighter spectral regions produce denser activations.
  • Figure 5: Population-coded MFCC binarisation pipeline. Top row: mel spectrogram of a single phone (/eh/) is transformed into 13 MFCC coefficients, encoded by Gaussian tuning curves into a continuous population code, and thresholded to produce a sparse binary vector. Bottom row: detail of the Gaussian tuning curves for one coefficient (C1), showing how a single MFCC value activates overlapping neurons, and the resulting continuous and binary activations.
  • ...and 4 more figures