Table of Contents
Fetching ...

Rethinking Data: Towards Better Performing Domain-Specific Small Language Models

Boris Nazarov, Darya Frolova, Yackov Lubarsky, Alexei Gaissinski, Pavel Kisilev

TL;DR

This work tackles domain-specific MC-QA with resource-efficient small LMs by building a modular pipeline that enhances data quality, context selection, and generalization. It introduces a data-preparation strategy that converts documents into compact, semantically meaningful chunks and a Chunk Re-Ranker (CRR) to refine retrieved context, all applied to a $2.5B$ Phi-2 model. Through cross-validated fine-tuning, CRR-based filtering, and weight-averaged model merging across data subsets, the approach achieves competitive MC-QA accuracy, attaining $77\%$ on a public test and $79.7\%$ on a private test. The results demonstrate that careful data engineering and context management can enable high performance from domain-specific small LMs, offering a cost-effective alternative to large LLMs for telecom knowledge tasks. The methods have practical impact for deploying domain-focused LMs where computational efficiency and data quality are critical.

Abstract

Fine-tuning of Large Language Models (LLMs) for downstream tasks, performed on domain-specific data has shown significant promise. However, commercial use of such LLMs is limited by the high computational cost required for their deployment at scale. On the other hand, small Language Models (LMs) are much more cost effective but have subpar performance in a similar setup. This paper presents our approach to finetuning a small LM, that reaches high accuracy in multiple choice question answering task. We achieve this by improving data quality at each stage of the LM training pipeline. In particular, we start with data structuring resulting in extraction of compact, semantically meaningful text chunks used by a retriever. This allows more efficient knowledge digestion by the LM. Further, we improve the retrieved context by training a lightweight Chunk Re-Ranker (CRR) that generates more accurate relative relevance chunk scores. Finally, we improve the model generalization ability by merging the models fine-tuned with different parameters on different data subsets. We present detailed procedure descriptions, and corresponding experimental findings that show the improvements of each one of the proposed techniques.

Rethinking Data: Towards Better Performing Domain-Specific Small Language Models

TL;DR

This work tackles domain-specific MC-QA with resource-efficient small LMs by building a modular pipeline that enhances data quality, context selection, and generalization. It introduces a data-preparation strategy that converts documents into compact, semantically meaningful chunks and a Chunk Re-Ranker (CRR) to refine retrieved context, all applied to a Phi-2 model. Through cross-validated fine-tuning, CRR-based filtering, and weight-averaged model merging across data subsets, the approach achieves competitive MC-QA accuracy, attaining on a public test and on a private test. The results demonstrate that careful data engineering and context management can enable high performance from domain-specific small LMs, offering a cost-effective alternative to large LLMs for telecom knowledge tasks. The methods have practical impact for deploying domain-focused LMs where computational efficiency and data quality are critical.

Abstract

Fine-tuning of Large Language Models (LLMs) for downstream tasks, performed on domain-specific data has shown significant promise. However, commercial use of such LLMs is limited by the high computational cost required for their deployment at scale. On the other hand, small Language Models (LMs) are much more cost effective but have subpar performance in a similar setup. This paper presents our approach to finetuning a small LM, that reaches high accuracy in multiple choice question answering task. We achieve this by improving data quality at each stage of the LM training pipeline. In particular, we start with data structuring resulting in extraction of compact, semantically meaningful text chunks used by a retriever. This allows more efficient knowledge digestion by the LM. Further, we improve the retrieved context by training a lightweight Chunk Re-Ranker (CRR) that generates more accurate relative relevance chunk scores. Finally, we improve the model generalization ability by merging the models fine-tuned with different parameters on different data subsets. We present detailed procedure descriptions, and corresponding experimental findings that show the improvements of each one of the proposed techniques.

Paper Structure

This paper contains 16 sections, 6 figures.

Figures (6)

  • Figure 1: Data example from TeleQnA dataset b14. Each data unit is represented in standardized format, comprising of five distinct fields: [Question, Options, Answer, Explanation, Category]
  • Figure 2: Our method workflow. The pipeline consists of the three main modules: 'Data pre-processing', 'CRR' and (SFT) 'Model'. In the 'Data pre-processing' module, document text is structured and cut into semantically meaningful chunks, followed by the vector DB creation. The 'CRR' module generates ground truth for consequitive training of the Chunk Re-Ranker Language Model. The rightmost 'Model' module performs pre-training and fine-tuning on the newly created data corpus and MC-QA set. During the inference stage (the bottom part of the scheme), the question is sent to the retriever to extract a set of chunks. Next, the CRR re-calculates the chunks relevance score with respect to the question. The most relevant chunks with the highest relevance score are then used as the context in the generation part of the workflow.
  • Figure 3: Data parsing example: left - the original unstructured text under the title of chapter, subtitles of a few subsequent chapters, followed by the original text chunk; right - a post-processed structure: each chunk is enriched with an added header comprising of document title, chapter title and subtitles.
  • Figure 4: Chunk size vs. the number of chunks trade off, and the golden chunk order: (a) Accuracy of the trained model as a function of the chunk size and the number of chunks used as the context prompt for the RAG (blue, red, green and magenta encode the chunk size of 128, 192, 256 and 512 tokens respectively); (b) The number of the 'golden' chunks vs their corresponding IDs: the 'golden' chunk is usually retrieved the first, but in many cases it appears in the 2-nd, 3-rd and even 7-th position out of total 7.
  • Figure 5: Influence of the context content on the accuracy of the model. The bar height encodes the model accuracy: no context (green), with 7 extracted standard chunks as the context (blue), and 7 extracted chunks using our text restructuring (orange). Prompting with the context based on our structured text chunks improves accuracy by up to 4% (which constitutes a substantial improvement relative to the data pre-processing cost). We used two embedding types (MS and SFR-2); both yield similar accuracy.
  • ...and 1 more figures