Table of Contents
Fetching ...

ColBERT Retrieval and Ensemble Response Scoring for Language Model Question Answering

Alex Gichamba, Tewodros Kederalah Idris, Brian Ebiyau, Eric Nyberg, Teruko Mitamura

TL;DR

The paper tackles domain-specific QA for telecom standards by combining ColBERT-based retrieval, lexicon augmentation, and model-specific tuning for two small LMs, Phi-2 and Falcon-7B. Phi-2 leverages LoRA fine-tuning with a retrieval-augmented prompt, while Falcon-7B uses a prompt-only strategy plus a text-entailment–oriented scoring module. Across in-domain TeleQnA-like tasks and out-of-domain pharmacology data, the approach delivers strong results (81.9% for Phi-2; 57.3% for Falcon-7B) and reveals that retrieval quality and context management are critical for performance. The work demonstrates that dense retrieval and domain lexicons can dramatically boost the capabilities of smaller LLMs in knowledge-intensive domains, with publicly released code and models to support further research and deployment.

Abstract

Domain-specific question answering remains challenging for language models, given the deep technical knowledge required to answer questions correctly. This difficulty is amplified for smaller language models that cannot encode as much information in their parameters as larger models. The "Specializing Large Language Models for Telecom Networks" challenge aimed to enhance the performance of two small language models, Phi-2 and Falcon-7B in telecommunication question answering. In this paper, we present our question answering systems for this challenge. Our solutions achieved leading marks of 81.9% accuracy for Phi-2 and 57.3% for Falcon-7B. We have publicly released our code and fine-tuned models.

ColBERT Retrieval and Ensemble Response Scoring for Language Model Question Answering

TL;DR

The paper tackles domain-specific QA for telecom standards by combining ColBERT-based retrieval, lexicon augmentation, and model-specific tuning for two small LMs, Phi-2 and Falcon-7B. Phi-2 leverages LoRA fine-tuning with a retrieval-augmented prompt, while Falcon-7B uses a prompt-only strategy plus a text-entailment–oriented scoring module. Across in-domain TeleQnA-like tasks and out-of-domain pharmacology data, the approach delivers strong results (81.9% for Phi-2; 57.3% for Falcon-7B) and reveals that retrieval quality and context management are critical for performance. The work demonstrates that dense retrieval and domain lexicons can dramatically boost the capabilities of smaller LLMs in knowledge-intensive domains, with publicly released code and models to support further research and deployment.

Abstract

Domain-specific question answering remains challenging for language models, given the deep technical knowledge required to answer questions correctly. This difficulty is amplified for smaller language models that cannot encode as much information in their parameters as larger models. The "Specializing Large Language Models for Telecom Networks" challenge aimed to enhance the performance of two small language models, Phi-2 and Falcon-7B in telecommunication question answering. In this paper, we present our question answering systems for this challenge. Our solutions achieved leading marks of 81.9% accuracy for Phi-2 and 57.3% for Falcon-7B. We have publicly released our code and fine-tuned models.
Paper Structure (28 sections, 2 figures, 8 tables)

This paper contains 28 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Both our QA systems feature a ColBERT-based retrieval pipeline and lexicon enhancement. Phi-2 is fine-tuned for instruction alignment and to invoke better reasoning. * The Falcon-7B prompt doesn't include the options. Since the responses are not conditioned on the options, we evaluated them using an ensemble scoring system to find the most likely option.
  • Figure 2: Accuracy as a function of the number of chunks ($k$) for different chunk sizes (CS). The number of chunks was incrementally increased beyond the context window length, leading to sharp declines for larger values of $k$.