Table of Contents
Fetching ...

Supervised In-Context Fine-Tuning for Generative Sequence Labeling

David Dukić, Goran Glavaš, Jan Šnajder

TL;DR

The paper tackles sequence labeling with decoder-based LLMs by introducing SIFT, a framework that unites supervised fine-tuning with in-context demonstrations under a generative, response-focused objective. By comparing vanilla CLM, SRC, and MRC strategies across four SL tasks and five LLMs, the authors show that multi-response completion (MRC) and dense demonstrations yield substantial gains over traditional ICL and decoder-as-encoder baselines. They also reveal that long-context settings can be mitigated by omitting the instruction, suggesting a practical preference for instruction-free prompts in many SL scenarios. The findings underscore the potential of response-based, generative task formulations for robust SL with LLMs and highlight concrete best practices for SIFT in real-world NL tasks.

Abstract

Sequence labeling (SL) tasks, where labels are assigned to tokens, are abundant in NLP (e.g., named entity recognition and aspect-based sentiment analysis). Owing to the intuition that they require bidirectional context, SL tasks are commonly tackled with encoder-only models. Recent work also shows that removing the causal mask in fine-tuning enables decoder-based LLMs to become effective token classifiers. Less work, however, focused on (supervised) generative SL, a more natural setting for causal LLMs. Due to their rapid scaling, causal LLMs applied to SL are expected to outperform encoders, whose own development has stagnated. In this work, we propose supervised in-context fine-tuning (SIFT) for generative SL. SIFT casts SL tasks as constrained response generation, natural to LLMs, combining in-context learning (ICL) from demonstrations with supervised fine-tuning. SIFT considerably outperforms both ICL and decoder-as-encoder fine-tuning baselines on a range of standard SL tasks. We further find that although long context hinders the performance of generative SL in both ICL and SIFT, this deficiency can be mitigated by removing the instruction, as instructions are shown to be largely unnecessary for achieving strong SL performance with SIFT. Our findings highlight strengths and limitations of SL with LLMs, underscoring the importance of a response-based generative task formulation for effective SL performance.

Supervised In-Context Fine-Tuning for Generative Sequence Labeling

TL;DR

The paper tackles sequence labeling with decoder-based LLMs by introducing SIFT, a framework that unites supervised fine-tuning with in-context demonstrations under a generative, response-focused objective. By comparing vanilla CLM, SRC, and MRC strategies across four SL tasks and five LLMs, the authors show that multi-response completion (MRC) and dense demonstrations yield substantial gains over traditional ICL and decoder-as-encoder baselines. They also reveal that long-context settings can be mitigated by omitting the instruction, suggesting a practical preference for instruction-free prompts in many SL scenarios. The findings underscore the potential of response-based, generative task formulations for robust SL with LLMs and highlight concrete best practices for SIFT in real-world NL tasks.

Abstract

Sequence labeling (SL) tasks, where labels are assigned to tokens, are abundant in NLP (e.g., named entity recognition and aspect-based sentiment analysis). Owing to the intuition that they require bidirectional context, SL tasks are commonly tackled with encoder-only models. Recent work also shows that removing the causal mask in fine-tuning enables decoder-based LLMs to become effective token classifiers. Less work, however, focused on (supervised) generative SL, a more natural setting for causal LLMs. Due to their rapid scaling, causal LLMs applied to SL are expected to outperform encoders, whose own development has stagnated. In this work, we propose supervised in-context fine-tuning (SIFT) for generative SL. SIFT casts SL tasks as constrained response generation, natural to LLMs, combining in-context learning (ICL) from demonstrations with supervised fine-tuning. SIFT considerably outperforms both ICL and decoder-as-encoder fine-tuning baselines on a range of standard SL tasks. We further find that although long context hinders the performance of generative SL in both ICL and SIFT, this deficiency can be mitigated by removing the instruction, as instructions are shown to be largely unnecessary for achieving strong SL performance with SIFT. Our findings highlight strengths and limitations of SL with LLMs, underscoring the importance of a response-based generative task formulation for effective SL performance.

Paper Structure

This paper contains 25 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Supervised in-context fine-tuning (SIFT) for sequence labeling tasks with three different strategies for generative fine-tuning () with in-context demonstrations: (a) vanilla: causal language modeling (CLM) carried out on all prompt tokens; (b) single-response completion (SRC): CLM on the response tokens of the last instance; and (c) multi-response completion (MRC): CLM on response tokens of all demonstration instances and last, target instance. In-context learning (ICL) with constrained decoding at inference ().
  • Figure 2: Supervised in-context fine-tuning (SIFT; in orange box) as a task-specific learning paradigm for LLMs, in relation to (standard) supervised fine-tuning (SFT; in purple box) and in-context learning (ICL; in blue box). For completion, zero-shot inference (no labeled instances) is shown in the green box.
  • Figure 3: Micro F1 scores using five base and instruct variants of decoders on ICL and for a varying number of shots. The x-axis shows the number of shots on an ordinal scale. The results are given for the validation set on four tasks (left to right), with top row plots corresponding to instruct variants and bottom row plots corresponding to base variants. All results are averages of four runs.
  • Figure 4: Micro F1 scores for five base variants of decoders on standard SFT and SIFT for a varying number of shots. The models are evaluated with the same number of shots in the context that they used for fine-tuning. The results are given for the validation set on four tasks (left to right) and for three CLM strategies (top to bottom). All results are averages of four runs. See \ref{['sec:complementary_res']} for instruct variants.
  • Figure 5: Instruction variants applied on the validation set at inference time for Mistral-7B (base) models trained with MRC and without instructions. The reference (Ref.) lines show the results for the same models trained with instructions. All results are averages of four runs.
  • ...and 2 more figures