Table of Contents
Fetching ...

Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

David Dukić, Jan Šnajder

TL;DR

This work investigates how decoder-only LLMs can be enhanced for sequence labeling by removing the causal mask in a layer-wise, group-based fashion during fine-tuning. By unmasking selected groups of decoder blocks (out of four groups of eight blocks each, i.e., $m=4$, $b=8$, $n=32$ blocks) and training with QLoRA, the authors demonstrate that selective CM removal yields competitive or superior SL performance on NER, ABSA, ACE05 Trigger Classification, and Chunking, compared to strong encoder baselines and IT baselines. The key finding is that the optimal unmasking configuration is task-dependent and that full unmasking (1111) is rarely the best choice; layer proximity to the model's output generally yields larger gains. A scale experiment shows that the benefits of CM removal emerge at larger parameter counts (7B) and do not arise at small scales ($ ext{68M}$), while MLM pre-training can hinder performance on some tasks. Overall, the work highlights the importance of model architecture and scale for SL with open LLMs and points to future avenues for efficient prediction of CM-removal effects without exhaustive tuning.

Abstract

Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs' poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs' performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs.

Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

TL;DR

This work investigates how decoder-only LLMs can be enhanced for sequence labeling by removing the causal mask in a layer-wise, group-based fashion during fine-tuning. By unmasking selected groups of decoder blocks (out of four groups of eight blocks each, i.e., , , blocks) and training with QLoRA, the authors demonstrate that selective CM removal yields competitive or superior SL performance on NER, ABSA, ACE05 Trigger Classification, and Chunking, compared to strong encoder baselines and IT baselines. The key finding is that the optimal unmasking configuration is task-dependent and that full unmasking (1111) is rarely the best choice; layer proximity to the model's output generally yields larger gains. A scale experiment shows that the benefits of CM removal emerge at larger parameter counts (7B) and do not arise at small scales (), while MLM pre-training can hinder performance on some tasks. Overall, the work highlights the importance of model architecture and scale for SL with open LLMs and points to future avenues for efficient prediction of CM-removal effects without exhaustive tuning.

Abstract

Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs' poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs' performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs.
Paper Structure (29 sections, 1 equation, 3 figures, 3 tables)

This paper contains 29 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Layer-wise causal mask removal from decoder block groups in a decoder-only LLM. Here, the causal mask is removed from the top eight decoder blocks of the Llama2-7B model to enable bidirectionality during fine-tuning, which proves beneficial for many SL tasks.
  • Figure 2: Micro F1 SL scores with decoder-based LLMs and different unmasking configurations (sorted by Gray code starting with all decoder layers masked -- configuration 0000). Upper row plots show Llama2-7B model results, while lower row plots show Mistral-7B model results on validation and test sets of four SL tasks (left to right, one dataset per task). All results are averages of five runs. The shaded area corresponds to standard deviation.
  • Figure 3: Validation set micro F1 SL scores of small pre-trained MLM-based encoder and CLM-based decoder models after fine-tuning on SL tasks training data starting from particular LM checkpoint. Decoder Unmask: model pre-trained with CLM and a CM but fine-tuned without the CM. All results are averages over five runs. The shaded area represents the standard deviation.