Table of Contents
Fetching ...

Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini

TL;DR

This work investigates whether small LLMs (~1B parameters) can match larger baselines on Italian medical NLP tasks by systematically evaluating inference-time and training-time adaptations across Llama-3, Gemma-3, and Qwen-3. It compares zero-shot, constraint decoding, few-shot prompting, instruction-tuning, and continual pre-training, finding that supervised fine-tuning is typically the most effective strategy, with few-shot prompting plus constraint decoding offering strong low-resource options. The study introduces a large, public Italian medical NLP dataset collection and a 126M-word clinical CPT corpus, demonstrating that a compact model such as Qwen-3-1.7B with FT can outperform a much larger baseline by about $+9.2$ points on average, while CPT provides more limited gains. Overall, the results support the viability of small LLMs for practical medical NLP in resource-constrained healthcare settings, emphasizing task- and data-specific adaptation to achieve strong generalization, including in OOD contexts.

Abstract

Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

TL;DR

This work investigates whether small LLMs (~1B parameters) can match larger baselines on Italian medical NLP tasks by systematically evaluating inference-time and training-time adaptations across Llama-3, Gemma-3, and Qwen-3. It compares zero-shot, constraint decoding, few-shot prompting, instruction-tuning, and continual pre-training, finding that supervised fine-tuning is typically the most effective strategy, with few-shot prompting plus constraint decoding offering strong low-resource options. The study introduces a large, public Italian medical NLP dataset collection and a 126M-word clinical CPT corpus, demonstrating that a compact model such as Qwen-3-1.7B with FT can outperform a much larger baseline by about points on average, while CPT provides more limited gains. Overall, the results support the viability of small LLMs for practical medical NLP in resource-constrained healthcare settings, emphasizing task- and data-specific adaptation to achieve strong generalization, including in OOD contexts.

Abstract

Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
Paper Structure (41 sections, 3 figures, 10 tables)

This paper contains 41 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Average performances of 1B LLMs on the 14 medical sub-tasks when different methods are applied at both inference (left) and training (right) time. Exposing models to Fine-Tuning (FT) turns out to be the most effective approach overall, consistently outperforming the baseline (Qwen3-32B with 4-shot). Continual Pre-Training (CPT) has a positive impact with respect to simple FT only in one case (gemma-3-1b-it). 4-shot is consistently better than Constraint Decoding (CD), and the combination of the two shows to be beneficial.
  • Figure 2: Impact on inference time of using 4-shot and Constraint Decoding (CD) settings. While 4-shot significantly increases the time required to run the inference, CD does not. The average is calculated among 5 models and 14 subtasks, using the vLLM and outlines libraries for model serving.
  • Figure 3: Visualization of sequence packing and the corresponding 2D attention mask that prevents cross-sample attention.