Table of Contents
Fetching ...

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Parth Asawa, Alexandros G. Dimakis, Matei Zaharia

Abstract

Natural language context-such as instructions, knowledge, or feedback-contains rich signal for adapting language models. While in-context learning provides adaptation via the prompt, parametric learning persists into model weights and can improve performance further, though is data hungry and heavily relies on either high-quality traces or automated verifiers. We propose SIEVE, a method for sample-efficient parametric learning from natural language context that requires as few as three query examples. SIEVE uses a novel synthetic data generation pipeline, SIEVE-GEN, that leverages the insight that context is decomposable. Decomposing context allows us to generate higher quality rollouts by pairing synthetic queries with only the applicable context rather than the entirety, then using context distillation to internalize context into the model. We evaluate in reasoning settings where context is necessary, including custom domains and the RuleArena and Machine Translation from One Book tasks. Our results show that SIEVE outperforms prior context distillation methods using just three query examples, demonstrating how to achieve sample-efficient parametric learning from natural language.

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Abstract

Natural language context-such as instructions, knowledge, or feedback-contains rich signal for adapting language models. While in-context learning provides adaptation via the prompt, parametric learning persists into model weights and can improve performance further, though is data hungry and heavily relies on either high-quality traces or automated verifiers. We propose SIEVE, a method for sample-efficient parametric learning from natural language context that requires as few as three query examples. SIEVE uses a novel synthetic data generation pipeline, SIEVE-GEN, that leverages the insight that context is decomposable. Decomposing context allows us to generate higher quality rollouts by pairing synthetic queries with only the applicable context rather than the entirety, then using context distillation to internalize context into the model. We evaluate in reasoning settings where context is necessary, including custom domains and the RuleArena and Machine Translation from One Book tasks. Our results show that SIEVE outperforms prior context distillation methods using just three query examples, demonstrating how to achieve sample-efficient parametric learning from natural language.

Paper Structure

This paper contains 17 sections, 3 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Sieve system overview. Given a natural language context corpus and as few as 3 seed query examples, Sieve-Gen generates synthetic training data composed of (query, applicable context) pairs. These pairs are used for context distillation, where a student model learns to match a teacher's distribution conditioned on applicable context, internalizing the knowledge into weights for inference without context.
  • Figure 2: Sieve improves with scale while real data input is constant. Across various domains, Sieve improves as we scale the amount of data we generate with Sieve-Gen (using the same fixed three example queries as inputs), approximately matching or exceeding ICL baseline performance when evaluated without any context. All domains use the Qwen3-8B model family with thinking disabled.
  • Figure 3: Comparison to baseline context distillation methods. We compare Sieve against vanilla context distillation baselines across domains. $V_{CD}$ (3 seeds) trains on only the three seed query examples with all context. $V_{CD-S}$ (8K) uses our synthetically generated queries but includes all context during rollout generation (no selective filtering). Sieve generates synthetic data from three seeds to 8/16K scales and outperforms baselines with the same amount of training data in all scenarios.
  • Figure 4: Sieve generalizes across model families. We evaluate Sieve on the Retail domain using alternative model families: Llama 3.1 8B and Rnj 1 8B. Results demonstrate that Sieve consistently improves model performance across diverse architectures (8K training examples).