Table of Contents
Fetching ...

LLM REgression with a Latent Iterative State Head

Yiheng Su, Matthew Lease

Abstract

We present RELISH (REgression with a Latent Iterative State Head), a novel, lightweight architecture designed for text regression with large language models. Rather than decoding numeric targets as text or aggregating multiple generated outputs, RELISH predicts scalar values directly from frozen LLM representations by iteratively refining a learned latent state through cross-attention over token-level representations, and then mapping the final state to a point estimate with a linear regressor. Across five datasets, four LLM backbones, and two LLM training regimes, RELISH consistently outperforms prior baselines from all three major LLM regression families, including autoregressive decoding, regression-aware inference, and existing predictive head methods. Despite these gains, RELISH remains highly parameter-efficient, requiring only 3.4-3.7M trainable parameters across frozen LLM backbones (only 0.01-0.04% additional overhead), far less than LoRA-based alternatives that grow with model size (0.26-0.42%).

LLM REgression with a Latent Iterative State Head

Abstract

We present RELISH (REgression with a Latent Iterative State Head), a novel, lightweight architecture designed for text regression with large language models. Rather than decoding numeric targets as text or aggregating multiple generated outputs, RELISH predicts scalar values directly from frozen LLM representations by iteratively refining a learned latent state through cross-attention over token-level representations, and then mapping the final state to a point estimate with a linear regressor. Across five datasets, four LLM backbones, and two LLM training regimes, RELISH consistently outperforms prior baselines from all three major LLM regression families, including autoregressive decoding, regression-aware inference, and existing predictive head methods. Despite these gains, RELISH remains highly parameter-efficient, requiring only 3.4-3.7M trainable parameters across frozen LLM backbones (only 0.01-0.04% additional overhead), far less than LoRA-based alternatives that grow with model size (0.26-0.42%).

Paper Structure

This paper contains 80 sections, 23 equations, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Detailed results across datasets (columns) and LLMs (rows). The rightmost column averages over all datasets, and the bottom row averages over LLMs. Bars show absolute improvement in Pearson ($r$) correlation over the zero-shot baseline (the "ground zero" line). Error bars denote variability across 3 independently-seeded runs. All metrics are computed using gold test labels. consistently performs best across all settings.
  • Figure 2: Spearman ($\rho$) correlation across datasets (columns) and LLMs (rows). The rightmost column averages over all datasets, and the bottom row averages over LLMs. Whereas earlier Figure \ref{['fig:main_result_condensed']} showed Pearson correlation, this figure shows absolute improvement in Spearman correlation over the zero-shot baseline (the "ground zero" line). Error bars denote variability across 3 independently-seeded runs. All metrics are computed using gold test labels. consistently performs best across all settings.
  • Figure 3: NRMSE results across datasets (columns) and LLMs (rows). The rightmost column averages over all datasets, and the bottom row averages over LLMs. Whereas earlier Figure \ref{['fig:main_result_condensed']} showed Pearson correlation, this figure shows absolute improvement in range-normalized root mean squared error (NRMSE) over the zero-shot baseline (the "ground zero" line). Error bars denote variability across 3 independently-seeded runs. All metrics are computed using gold test labels. consistently performs best across all settings.
  • Figure 4: loss ablations. We compare LoRA + MLP, RAFT, and two variants trained with MSE or Huber across five datasets (columns) and four LLM backbones (rows). Metrics are NRMSE (lower is better), Pearson correlation (higher is better), and Spearman correlation (higher is better). Error bars show variability across 3 seeds. Overall, MSE and Huber perform similarly, but Huber is modestly better in most settings.
  • Figure 5: refinement depth ablations. We systematically vary the refinement depth of from 1 to 5. We report range-normalized root mean square error (NRMSE; lower is better), Pearson correlation, and Spearman correlation (higher is better) macro-averaged across five datasets for four LLM backbones. Error bars denote the standard deviation across three random seeds. The $L=1$ configuration isolates the performance of a single-step, attention-based pooling baseline. For most LLM backbones, multiple refinement iterations ($L \ge 2$) substantially enhance predictive performance, with diminishing returns past $L=3$ (our default setting, shaded in grey). Notably, the optimal depth varies by LLM architecture: Qwen 3 8B achieves peak performance at $L=1$, while the larger Gemma 3 27B continues to benefit from refinement even at higher depths.