Table of Contents
Fetching ...

LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering

Patrick Sutanto, Joan Santoso, Esther Irawati Setiawan, Aji Prasetya Wibawa

TL;DR

The paper addresses efficient few-shot MCQA by distilling knowledge from large language models (LLMs) into a compact encoder. It uses LLMs to generate synthetic MCQA data and to assign per-choice soft labels, which guide training of a DeBERTa-base-v3 encoder via distillation. It evaluates two data-generation strategies—direct JSON generation and a decomposed three-stage pipeline—and shows that data generation plus distillation yields substantial improvements on the MMLU benchmark, approaching the performance of larger, instruction-tuned models while using far fewer parameters. The results demonstrate the practical potential of combining LLM-generated data with soft-label distillation for resource-efficient few-shot MCQA, with clear directions for improving data quality, distillation techniques, and extending to longer-context tasks.

Abstract

Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education. The high cost of building MCQA datasets makes few-shot learning pivotal in this domain. While Large Language Models (LLMs) can enable few-shot learning, their direct application in real-world scenarios is often hindered by their high computational cost. To address this challenge, we propose a simple yet effective approach that uses LLMs for data generation and scoring. Our approach utilizes LLMs to create MCQA data which contains questions and choices, and to assign probability scores to the generated choices. We then use the generated data and LLM-assigned scores to finetune a smaller and more efficient encoder-only model, DeBERTa-v3-base by leveraging distillation loss. Extensive experiments on the Massive Multitask Language Understanding (MMLU) benchmark demonstrate that our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples. This shows the effectiveness of LLM-driven data generation and knowledge distillation for few-shot MCQA.

LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering

TL;DR

The paper addresses efficient few-shot MCQA by distilling knowledge from large language models (LLMs) into a compact encoder. It uses LLMs to generate synthetic MCQA data and to assign per-choice soft labels, which guide training of a DeBERTa-base-v3 encoder via distillation. It evaluates two data-generation strategies—direct JSON generation and a decomposed three-stage pipeline—and shows that data generation plus distillation yields substantial improvements on the MMLU benchmark, approaching the performance of larger, instruction-tuned models while using far fewer parameters. The results demonstrate the practical potential of combining LLM-generated data with soft-label distillation for resource-efficient few-shot MCQA, with clear directions for improving data quality, distillation techniques, and extending to longer-context tasks.

Abstract

Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education. The high cost of building MCQA datasets makes few-shot learning pivotal in this domain. While Large Language Models (LLMs) can enable few-shot learning, their direct application in real-world scenarios is often hindered by their high computational cost. To address this challenge, we propose a simple yet effective approach that uses LLMs for data generation and scoring. Our approach utilizes LLMs to create MCQA data which contains questions and choices, and to assign probability scores to the generated choices. We then use the generated data and LLM-assigned scores to finetune a smaller and more efficient encoder-only model, DeBERTa-v3-base by leveraging distillation loss. Extensive experiments on the Massive Multitask Language Understanding (MMLU) benchmark demonstrate that our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples. This shows the effectiveness of LLM-driven data generation and knowledge distillation for few-shot MCQA.

Paper Structure

This paper contains 34 sections, 5 equations, 4 figures, 30 tables.

Figures (4)

  • Figure 1: Framework for Few-Shot MCQA using LLM-Generated Data and Distillation.
  • Figure 2: Effect of Generated Data Size on Few-Shot MCQA Accuracy. The figure compares the performance of DeBERTa-base-v3 trained on varying amounts of generated data (using both JSON and Decompose methods), with and without LLM distillation, against a baseline trained on real data from the ARC-Easy (a) and ARC-Challenge (b) datasets.
  • Figure 3: Average Maximum Cosine Similarity between Generated Questions and the Training/Test Sets on ARC-Easy and ARC-Challenge. Similarity is calculated between question embeddings, excluding choices.
  • Figure 4: Maximum Cosine Similarity Observed between Generated Questions and the Training/Test Sets on ARC-Easy and ARC-Challenge. Similarity is calculated between question embeddings, excluding choices.