Table of Contents
Fetching ...

Reasoning-Intensive Regression

Diane Tchuindjo, Omar Khattab

TL;DR

This work introduces Reasoning-Intensive Regression (RiR), a regime where downstream scoring from text requires deep, structured reasoning under scarce data. It benchmarks RiR with four tasks—Mathematical Error Detection, Instruction Following, Pairwise RAG Comparison, and Essay Grading—and shows that both prompting frozen LLMs and finetuning encoders struggle to deliver precise, well-calibrated scores. To address this, the authors propose MENTAT, a lightweight method that iteratively evolves prompts based on error analysis and learns a neural aggregator over multiple LLM rollouts, achieving substantial gains over baselines. The results highlight persistent challenges such as output quantization and variance, and they point to future directions in efficient RiR methods and robust benchmarking for practical, low-resource settings.

Abstract

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks, e.g., for sentiment or similarity, RiR often appears instead in ad-hoc problems such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.

Reasoning-Intensive Regression

TL;DR

This work introduces Reasoning-Intensive Regression (RiR), a regime where downstream scoring from text requires deep, structured reasoning under scarce data. It benchmarks RiR with four tasks—Mathematical Error Detection, Instruction Following, Pairwise RAG Comparison, and Essay Grading—and shows that both prompting frozen LLMs and finetuning encoders struggle to deliver precise, well-calibrated scores. To address this, the authors propose MENTAT, a lightweight method that iteratively evolves prompts based on error analysis and learns a neural aggregator over multiple LLM rollouts, achieving substantial gains over baselines. The results highlight persistent challenges such as output quantization and variance, and they point to future directions in efficient RiR methods and robust benchmarking for practical, low-resource settings.

Abstract

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks, e.g., for sentiment or similarity, RiR often appears instead in ad-hoc problems such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.

Paper Structure

This paper contains 52 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: On regression for detecting the first math error, finetuning a NeoBERT model collapses to mean predictions (CCC = 0.01). Meanwhile, detailed (human-crafted) prompting achieves reasonable concordance (CCC = 0.69) but exhibits coarse and imprecise prediction behavior (the dense horizontal lines and near-random NMSE). MENTAT's performance illustrates how RiR problems benefit from combining deep reasoning capabilities with precise numerical predictions.
  • Figure 2: Inspired by reasoning-intensive-retrieval's analysis of retrieval tasks, we break down text-based regression problems into three, informal complexity levels. Level 1 tasks use simple feature-based inputs (for example, the number of bedrooms and bathrooms when predicting home prices). Text-to-text regression achieves strong Level 1 performance with rich datasets text-to-text-regression. Level 2 tasks require moderate semantic understanding (sentiment analysis, reward modeling) but are easy for supervised-learning over a pretrained Transformer. Level 3, the focus of this work, represents Reasoning-Intensive Regression (RiR), which requires deep sequential reasoning.
  • Figure 3: Ground-truth score distributions for mathematical error detection (the spread capturing the tendency for solutions to fail towards the center), instruction following (capturing the tendency to favor the tails), pairwise RAG comparison (narrow distribution around averaged judgments), and essay grading (tight clustering characteristic of qualitative assessments).
  • Figure 4: Phase 1 performs prompt evolution through iterative and batched reflection. Given a candidate prompt, we collect rollouts on $n$ samples, divided into training and validation sets. A model instructed to focus on the $\sqrt{n}$ worst-performing examples (selected based on absolute prediction error) analyzes the rollouts on the training samples, in light of the optimization history from previous iterations, and makes proposals that refine the prompt. This cycle continues for a predetermined number of iterations, after which we select the best-performing prompt $P_{\text{best}}$ as evaluated on the validation set , where by best we mean the prompt that led to the highest CCC value. Phase 2 generates multi-rollout predictions by applying $P_{\text{best}}$ and aggregating $K$ stochastic predictions per input and trains a neural aggregator $f_\theta$ on sorted rollouts using a combined CCC--NMSE loss. Test predictions are obtained by sampling test rollouts and applying the trained aggregator $f_{\theta^{\star}}$.
  • Figure 5: Distribution of per-question rollout variances comparing the Detailed (human-crafted) prompt against the MENTAT-evolved prompt across three tasks. For reasoning-intensive tasks (Mathematical Error Detection and Pairwise RAG Comparison), MENTAT's prompt evolution yields lower mean rollout variance, indicating more consistent predictions across independent rollouts. In contrast, Essay Grading, which is characterized as a Level 2 (semantic analysis) task requiring less sequential reasoning, shows comparable variance between prompts. This pattern suggests that prompt evolution yields the greatest consistency gains on tasks where deep reasoning is essential, while contributing less when shallow semantic features suffice.
  • ...and 11 more figures