Table of Contents
Fetching ...

Complexity-aware fine-tuning

Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev

TL;DR

This work tackles the challenge of efficiently fine-tuning domain-adapted LLMs under resource constraints by introducing a complexity-aware pipeline that uses token-answer entropy to split data into regular and hard categories. Easy data receive standard supervised fine-tuning, while hard data leverage distillation of chain-of-thought from a larger model, enabling targeted reasoning where needed. Across two open 3B-scale models and the MMLU-Pro benchmark, the approach outperforms standard SFT and curriculum baselines and matches distillation performance while using up to $81\%$ less data. The study also conducts extensive sensitivity analyses on alternative complexity metrics and highlights the practical viability of entropy-based complexity signals for data curation and efficient fine-tuning.

Abstract

General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ($~3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.58$ vs $0.45$ average accuracy) and outperforms the distillation approach ($0.58$ vs $0.56$ average accuracy) while using $81%$ less data.

Complexity-aware fine-tuning

TL;DR

This work tackles the challenge of efficiently fine-tuning domain-adapted LLMs under resource constraints by introducing a complexity-aware pipeline that uses token-answer entropy to split data into regular and hard categories. Easy data receive standard supervised fine-tuning, while hard data leverage distillation of chain-of-thought from a larger model, enabling targeted reasoning where needed. Across two open 3B-scale models and the MMLU-Pro benchmark, the approach outperforms standard SFT and curriculum baselines and matches distillation performance while using up to less data. The study also conducts extensive sensitivity analyses on alternative complexity metrics and highlights the practical viability of entropy-based complexity signals for data curation and efficient fine-tuning.

Abstract

General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models () we split the training data into complexity categories by a single token answer entropy (ROC AUC ), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ( vs average accuracy) and outperforms the distillation approach ( vs average accuracy) while using less data.

Paper Structure

This paper contains 49 sections, 8 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Complexity-aware fine-tuning scheme for a student LLM: we identify complexity of questions via uncertainty estimation of a model (Step 1), then for questions of regular complexity we apply direct SFT (Step 2), while for hard questions we include reasoning from a teacher LLM (Step 3) and complete SFT using reasoning-enriched hard data (Step 4).
  • Figure 2: Data aggregation: we can further split the data to various complexity chunks
  • Figure 3: SFT quality dynamics during training with split by complexity estimates provided by the MASJ reasoning score and the single token entropy across Phi-4-mini and Qwen 3B models.
  • Figure 4: Pipeline complexity metric performance comparison for entropy and cross-entropy across Phi-4-mini and Qwen 3B models.
  • Figure 5: Curriculum learning accuracy dynamics for different models for Qwen 3B (left) and Phi-4-mini (right)
  • ...and 2 more figures