Table of Contents
Fetching ...

Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification

Mikołaj Langner, Jan Eliasz, Ewa Rudnicka, Jan Kocoń

TL;DR

This paper addresses efficient multi-label text classification in settings where the label space evolves over time. It introduces a dichotomic prompting framework that treats each label as an independent yes/no decision and leverages prefix caching to accelerate inference on decoder-only LLMs, with a distillation pipeline to train small language models from a high-capacity teacher. The authors show that dichotomic prompting achieves comparable accuracy to structured JSON prompts and yields superior zero-shot robustness for unseen labels, while delivering substantial speedups on short texts. They demonstrate these gains on a 10k Polish affective dataset covering $K=24$ dimensions, with four small models (HerBERT-Large, PLLuM-8B, CLARIN-1B, Gemma3-1B) trained via DeepSeek-V3 pseudo-labels. The framework is generalizable to other domains and languages, offering a scalable, cost-efficient solution for dynamic multi-label classification.

Abstract

We introduce a method for efficient multi-label text classification with large language models (LLMs), built on reformulating classification tasks as sequences of dichotomic (yes/no) decisions. Instead of generating all labels in a single structured response, each target dimension is queried independently, which, combined with a prefix caching mechanism, yields substantial efficiency gains for short-text inference without loss of accuracy. To demonstrate the approach, we focus on affective text analysis, covering 24 dimensions including emotions and sentiment. Using LLM-to-SLM distillation, a powerful annotator model (DeepSeek-V3) provides multiple annotations per text, which are aggregated to fine-tune smaller models (HerBERT-Large, CLARIN-1B, PLLuM-8B, Gemma3-1B). The fine-tuned models show significant improvements over zero-shot baselines, particularly on the dimensions seen during training. Our findings suggest that decomposing multi-label classification into dichotomic queries, combined with distillation and cache-aware inference, offers a scalable and effective framework for LLM-based classification. While we validate the method on affective states, the approach is general and applicable across domains.

Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification

TL;DR

This paper addresses efficient multi-label text classification in settings where the label space evolves over time. It introduces a dichotomic prompting framework that treats each label as an independent yes/no decision and leverages prefix caching to accelerate inference on decoder-only LLMs, with a distillation pipeline to train small language models from a high-capacity teacher. The authors show that dichotomic prompting achieves comparable accuracy to structured JSON prompts and yields superior zero-shot robustness for unseen labels, while delivering substantial speedups on short texts. They demonstrate these gains on a 10k Polish affective dataset covering dimensions, with four small models (HerBERT-Large, PLLuM-8B, CLARIN-1B, Gemma3-1B) trained via DeepSeek-V3 pseudo-labels. The framework is generalizable to other domains and languages, offering a scalable, cost-efficient solution for dynamic multi-label classification.

Abstract

We introduce a method for efficient multi-label text classification with large language models (LLMs), built on reformulating classification tasks as sequences of dichotomic (yes/no) decisions. Instead of generating all labels in a single structured response, each target dimension is queried independently, which, combined with a prefix caching mechanism, yields substantial efficiency gains for short-text inference without loss of accuracy. To demonstrate the approach, we focus on affective text analysis, covering 24 dimensions including emotions and sentiment. Using LLM-to-SLM distillation, a powerful annotator model (DeepSeek-V3) provides multiple annotations per text, which are aggregated to fine-tune smaller models (HerBERT-Large, CLARIN-1B, PLLuM-8B, Gemma3-1B). The fine-tuned models show significant improvements over zero-shot baselines, particularly on the dimensions seen during training. Our findings suggest that decomposing multi-label classification into dichotomic queries, combined with distillation and cache-aware inference, offers a scalable and effective framework for LLM-based classification. While we validate the method on affective states, the approach is general and applicable across domains.

Paper Structure

This paper contains 21 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Annotation and distillation pipeline. Raw texts are independently annotated via three DeepSeek-V3 passes. The outputs are aggregated (via majority vote) to form consensus pseudo-labels. These annotations are then verified by human annotators for reliability assessment and used to fine-tune small language models (SLMs), which are subsequently evaluated.
  • Figure 2: Prompting strategies for multi-label classification. Top: Structured JSON prompting produces all label predictions in a single response by filling a predefined JSON schema. Bottom: Dichotomic prompting issues one binary question per label and collects individual yes/no answers. Both strategies share the same input text but differ in how labels are queried and outputs are structured.
  • Figure 3: Prompt structure configurations for evaluating prefix caching efficiency. Each column (Case 1–3) represents a different arrangement of prompt components: Instruction, Text, and Target Label (Dimension). Green boxes denote cacheable segments reused across prompts; gray boxes indicate uncached, recomputed parts; and orange boxes highlight the position of the queried affective label. Case 3 maximizes cache utilization by placing both the instruction and input text in the shared prefix.
  • Figure 4: Evaluation protocol for all settings. Solid cyan path: Fine-tuning and evaluation on all labels. Dashed red path: Leave-one-out (LOO) strategy where one label is excluded from training and evaluated separately. Dotted gray path: Zero-shot evaluation using the base model without fine-tuning.
  • Figure 5: Relationship between label frequency and annotation agreement. Blue bars show the percentage of positive annotations per label, while dashed lines indicate Positive Specific Agreement (PSA): intra-model consistency (red), annotator A1 vs. model (purple), and annotator A2 vs. model (orange). Reported Spearman correlations (top left) quantify the relationship between label prevalence and PSA, showing that more frequent labels tend to yield higher agreement.
  • ...and 3 more figures