Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling
David Dukić, Jan Šnajder
TL;DR
This work investigates how decoder-only LLMs can be enhanced for sequence labeling by removing the causal mask in a layer-wise, group-based fashion during fine-tuning. By unmasking selected groups of decoder blocks (out of four groups of eight blocks each, i.e., $m=4$, $b=8$, $n=32$ blocks) and training with QLoRA, the authors demonstrate that selective CM removal yields competitive or superior SL performance on NER, ABSA, ACE05 Trigger Classification, and Chunking, compared to strong encoder baselines and IT baselines. The key finding is that the optimal unmasking configuration is task-dependent and that full unmasking (1111) is rarely the best choice; layer proximity to the model's output generally yields larger gains. A scale experiment shows that the benefits of CM removal emerge at larger parameter counts (7B) and do not arise at small scales ($ ext{68M}$), while MLM pre-training can hinder performance on some tasks. Overall, the work highlights the importance of model architecture and scale for SL with open LLMs and points to future avenues for efficient prediction of CM-removal effects without exhaustive tuning.
Abstract
Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs' poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs' performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs.
