Table of Contents
Fetching ...

Bringing Emerging Architectures to Sequence Labeling in NLP

Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares

TL;DR

This study broadens the evaluation of sequence labeling in NLP beyond Transformer encoders by systematically testing diffusion tagging, adversarial tagging, xLSTM, and SSD-based models across multilingual PoS, NER, and structured parsing tasks. The results show that adversarial tagging often matches or surpasses Transformer baselines, especially in complex structured settings, while diffusion tagging and structured-state-space models underperform in many cases. Non-Transformer encoders like the Bidirectional xLSTM and BiLSTM variants can excel on simpler tagging tasks but struggle to consistently beat Transformers on harder, long-range dependency problems. The findings suggest adversarial labeling as a promising direction for robust tagging across diverse linguistic structures, with practical implications for multilingual NLP where resource-conscious non-Transformer architectures can still deliver competitive performance. Limitations include computational resource demands and the use of MLM encoders over generative encoders, shaping the experimental design and scope of generalization.

Abstract

Pretrained Transformer encoders are the dominant approach to sequence labeling. While some alternative architectures-such as xLSTMs, structured state-space models, diffusion models, and adversarial learning-have shown promise in language modeling, few have been applied to sequence labeling, and mostly on flat or simplified tasks. We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages. We find that the strong performance previously observed in simpler settings does not always generalize well across languages or datasets, nor does it extend to more complex structured tasks.

Bringing Emerging Architectures to Sequence Labeling in NLP

TL;DR

This study broadens the evaluation of sequence labeling in NLP beyond Transformer encoders by systematically testing diffusion tagging, adversarial tagging, xLSTM, and SSD-based models across multilingual PoS, NER, and structured parsing tasks. The results show that adversarial tagging often matches or surpasses Transformer baselines, especially in complex structured settings, while diffusion tagging and structured-state-space models underperform in many cases. Non-Transformer encoders like the Bidirectional xLSTM and BiLSTM variants can excel on simpler tagging tasks but struggle to consistently beat Transformers on harder, long-range dependency problems. The findings suggest adversarial labeling as a promising direction for robust tagging across diverse linguistic structures, with practical implications for multilingual NLP where resource-conscious non-Transformer architectures can still deliver competitive performance. Limitations include computational resource demands and the use of MLM encoders over generative encoders, shaping the experimental design and scope of generalization.

Abstract

Pretrained Transformer encoders are the dominant approach to sequence labeling. While some alternative architectures-such as xLSTMs, structured state-space models, diffusion models, and adversarial learning-have shown promise in language modeling, few have been applied to sequence labeling, and mostly on flat or simplified tasks. We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages. We find that the strong performance previously observed in simpler settings does not always generalize well across languages or datasets, nor does it extend to more complex structured tasks.

Paper Structure

This paper contains 49 sections, 6 equations, 7 figures, 33 tables, 2 algorithms.

Figures (7)

  • Figure 1: Pareto front of LAS vs. speed (sent/s) on PTB dependency parsing. Colors are reserved for encodings, symbols and text annotations for architectures: LSTM (L), BiLSTM (B), xLSTM (L*), BixLSTM (B*), Mamba-2 (M2), XLM (X), DiT (XD) and GaT (XG).
  • Figure 2: PoS and NER accuracy ($y$) across output spaces ($x$, uneven intervals).
  • Figure 3: Diffusion tagger in forward and denoising steps. The symbol origin=c]90$\ominus$ is the concatenation operator and an open arrow ( ) loss propagation. In Figure \ref{['fig:diffusion-training']}, $\mathcal{E}_\theta$ embeds the sentence as the conditional signal. The real labels are transformed into bits and fed to the diffusion process, where the latent $\mathbf{x}_t$ is computed from the sampled noise $\mathbf{e}_t$, and concatenated with time embeddings $\tau(t)$ and the conditional signal. Then, $\mathcal{D}_\phi$ learns to extract the noise that was added to $\mathbf{x}_t$. All parameters are optimized with the MSE loss between the real and predicted noise. Figure \ref{['fig:diffusion-inference']} shows the denoising process. The conditional signal is computed once with $\mathcal{E}_\theta$ and an initial signal $\mathbf{x}_T$ is sampled from Gaussian noise. Iteratively, $\mathcal{D}_\phi$ removes noise from the input and conditional signal and estimates the previous latent $\hat{\mathbf{x}}_{t-s}$ until $\hat{\mathbf{x}}_0$ is reached. Then, $\hat{\mathbf{x}}_0$ is fed to the BT module to recover a sequence of predicted labels.
  • Figure 4: Adversarial tagger view (symbols as in Figure \ref{['fig:diffusion']}). $G_\psi$ (green) is trained with the tag loss. $D_\varphi$ (blue) learns to distinguish valid tag sequences and guides $G_\psi$.
  • Figure 5: Example of a constituent tree encoded with the relative encoding (Figure \ref{['fig:con-relative-example']}) and tetratagging (Figure \ref{['fig:con-tetra-example']}).
  • ...and 2 more figures