Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models

Ioana Buhnila; Aman Sinha; Mathieu Constant

Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models

Ioana Buhnila, Aman Sinha, Mathieu Constant

TL;DR

The paper tackles the risk of medical hallucinations and resource costs in large language models by proposing pRAGe, a Retrieval Augmented Generation pipeline that uses open-source Small Language Models and a French medical knowledge base (RefoMed-KB) to generate concise sub-sentential paraphrases for medical terms. It combines encoder-retriever-decoder architecture with prompted decoding and finetuning (via Q-LoRA) on the RefoMed dataset, and builds a substantial French KB from Wikipedia to ground outputs. Through extensive automatic and fine-grained human evaluation, the study shows that finetuning BIOMISTRAL within pRAGe improves the quality and correctness of short, patient-friendly paraphrases while maintaining manageable hallucination rates, and that RAG grounding can outperform purely parametric LM memory in certain conditions. The approach offers a reproducible, open-source pathway to safer, grounded medical text generation for lay audiences, with practical implications for accessible patient education and downstream Q&A tasks.

Abstract

Recent surge in the accessibility of large language models (LLMs) to the general population can lead to untrackable use of such models for medical-related recommendations. Language generation via LLMs models has two key problems: firstly, they are prone to hallucination and therefore, for any medical purpose they require scientific and factual grounding; secondly, LLMs pose tremendous challenge to computational resources due to their gigantic model size. In this work, we introduce pRAGe, a pipeline for Retrieval Augmented Generation and evaluation of medical paraphrases generation using Small Language Models (SLM). We study the effectiveness of SLMs and the impact of external knowledge base for medical paraphrase generation in French.

Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 3 figures, 9 tables)

This paper contains 34 sections, 1 equation, 3 figures, 9 tables.

Introduction
Related Work
Different types of RAG systems
Multilingual decoder models,
Methodology
Datasets
RefoMED
In order to avoid bias in the LM's finetuning,
Descriptive statistics for the paraphrases in the RefoMed dataset.
RefoMED-KB
Automatic Evaluation
Evaluation metrics.
RAG$ref$S ($S$)
Fine-grained Human Evaluation
- readability:
...and 19 more sections

Figures (3)

Figure 1: Illustration of pRAGe experimental pipeline. The illustration is intended to read from left to right. Each colored arrow represent a process. The $\blacksquare$ arrow indicates the creation of indexed database; the $\blacksquare$ arrow indicates the encoding of the query; the $\blacksquare$ arrow represents retrieval of relevant documents; the $\blacksquare$ arrow denotes the generation of simplified paraphrase output and the $\blacksquare$ arrow indicates the evaluation step for the generated output to obtain the evaluation profile of the paraphrase.
Figure 2: Our prompt template in French for inference.
Figure 3: Correlation Heatmap between Automatic evaluation metrics (y-axis) and Manual evaluation metrics (x-axis). The $\bigstar$ symbol denotes configurations with finetuned SLM.

Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models

TL;DR

Abstract

Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)