Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Sanket Badhe; Deep Shah

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Sanket Badhe, Deep Shah

Abstract

Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57\% to 90.0\% and 67\% to 83\% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Abstract

Paper Structure (31 sections, 3 figures, 2 tables)

This paper contains 31 sections, 3 figures, 2 tables.

Introduction
Related Work
Chain-of-Thought and Inference Efficiency
Knowledge Distillation and Reasoning Transfer
Automatic Prompt Optimization (APO)
Methodology
Phase 1: Supervised Instruction Extraction
Phase 2: Clustering Logic Synthesis
Phase 3: Conflict Resolution
Phase 4: Inference
Experimental Setup
Setup
Baseline
Datasets
Results and Analysis
...and 16 more sections

Figures (3)

Figure 1: Overview of Prompt-Level Distillation (PLD). The framework operates in four phases: (1) Supervised Instruction Extraction from training data, (2) Semantic Synthesis using clustering, (3) A Closed-Loop Conflict Resolution phase to refine logic, and (4) Zero-shot Inference using the consolidated system prompt.
Figure 2: E2E Model Latency between Gemma-3 4B, Gemini 2 Flash, and Gemini 3 Flash
Figure 3: Comparision of Model pricing

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Abstract

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Authors

Abstract

Table of Contents

Figures (3)