Table of Contents
Fetching ...

Improving Latent Generalization Using Test-time Compute

Arslan Chaudhry, Sridhar Thiagarajan, Andrew Lampinen

Abstract

Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or 'thinking', specifically to improve latent generalization. We use Reinforcement Learning (RL) from correctness feedback to train models to produce long chains-of-thought (CoTs) to improve latent generalization. Our experiments show that this thinking approach not only resolves many instances of latent generalization failures on in-distribution knowledge but also, unlike augmentation baselines, generalizes to new knowledge for which no RL was performed. Nevertheless, on pure reversal tasks, we find that thinking does not unlock direct knowledge inversion, but the generate-and-verify ability of thinking models enables them to get well above chance performance. The brittleness of factual self-verification means thinking models still remain well below the performance of in-context learning for this task. Overall, our results establish test-time thinking as a flexible and promising direction for improving the latent generalization of LMs.

Improving Latent Generalization Using Test-time Compute

Abstract

Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or 'thinking', specifically to improve latent generalization. We use Reinforcement Learning (RL) from correctness feedback to train models to produce long chains-of-thought (CoTs) to improve latent generalization. Our experiments show that this thinking approach not only resolves many instances of latent generalization failures on in-distribution knowledge but also, unlike augmentation baselines, generalizes to new knowledge for which no RL was performed. Nevertheless, on pure reversal tasks, we find that thinking does not unlock direct knowledge inversion, but the generate-and-verify ability of thinking models enables them to get well above chance performance. The brittleness of factual self-verification means thinking models still remain well below the performance of in-context learning for this task. Overall, our results establish test-time thinking as a flexible and promising direction for improving the latent generalization of LMs.

Paper Structure

This paper contains 30 sections, 9 figures.

Figures (9)

  • Figure 1: Comparison between train-time augmentation strategies vs test-time thinking to improve latent generalization.
  • Figure 2: Semantic Structure Experiment: F1 scores on in- and out-of-distribution holdouts. Both Finetuning and Augmentation undergo RL training as described in \ref{['sec:baselines']}. Thinking shows significant and consistent improvement over the baselines on both in- and out-of-distribution holdout splits, particularly for those splits that require multi-hop reasoning. Train-time augmentations completely fail to generalize on a new knowledge structures (out-of-distribution generalization).
  • Figure 3: Pass@N Accuracy: Exact string match accuracy of reversal curse paper dataset. Finetuning completely fails to generalize even with multiple attempts. Thinking gives non-trivial generalization but overall for such strict reversals ICL is the strongest baseline.
  • Figure 4: Semantic Structure Benchmark: Train set examples of Wikipedia style documents and qa examples (left) and test set example (right).
  • Figure 5: Effect of RL (without CoT reasoning) on baselines. All-F1 (top row), Train Recall (bottom row)
  • ...and 4 more figures