Table of Contents
Fetching ...

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

Ming Chen, Sheng Tang, Rong-Xi Tan, Ziniu Li, Jiacheng Chen, Ke Xue, Chao Qian

TL;DR

The paper tackles the mismatch between token-level supervision and continuous targets in decoding-based regression. It introduces GenRe^2, an RL-based framework that treats decoding as an MDP and optimizes sequence-level rewards to enforce global numerical coherence, yielding consistent improvements over token-level baselines on both tabular and code-metric regression. Through exhaustive experiments and ablations, the authors show that sequence-level feedback enhances predictive precision and sampling efficiency, establishing decoding-based regression as a robust approach for numerical prediction. The work also analyzes RL dynamics, output distribution shaping, and the stability of GRPO versus ReMax, offering guidance for future developments in RL-enabled regression methods.

Abstract

Decoding-based regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via Reinforcement Learning (RL). We formulate the generation process as a Markov Decision Process, utilizing sequence-level rewards to enforce global numerical coherence. Extensive experiments on tabular regression and code metric regression demonstrate that our method (specifically with ReMax and GRPO) consistently outperforms both state-of-the-art token-level baselines and traditional regression heads, showing the superiority of introducing sequence-level signals. Our analysis further reveals that RL significantly enhances sampling efficiency and predictive precision, establishing decoding-based regression as a robust and accurate paradigm for general-purpose numerical prediction.

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

TL;DR

The paper tackles the mismatch between token-level supervision and continuous targets in decoding-based regression. It introduces GenRe^2, an RL-based framework that treats decoding as an MDP and optimizes sequence-level rewards to enforce global numerical coherence, yielding consistent improvements over token-level baselines on both tabular and code-metric regression. Through exhaustive experiments and ablations, the authors show that sequence-level feedback enhances predictive precision and sampling efficiency, establishing decoding-based regression as a robust approach for numerical prediction. The work also analyzes RL dynamics, output distribution shaping, and the stability of GRPO versus ReMax, offering guidance for future developments in RL-enabled regression methods.

Abstract

Decoding-based regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via Reinforcement Learning (RL). We formulate the generation process as a Markov Decision Process, utilizing sequence-level rewards to enforce global numerical coherence. Extensive experiments on tabular regression and code metric regression demonstrate that our method (specifically with ReMax and GRPO) consistently outperforms both state-of-the-art token-level baselines and traditional regression heads, showing the superiority of introducing sequence-level signals. Our analysis further reveals that RL significantly enhances sampling efficiency and predictive precision, establishing decoding-based regression as a robust and accurate paradigm for general-purpose numerical prediction.

Paper Structure

This paper contains 43 sections, 19 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Illustration of decoding-based regression. The input $\mathbf{x}$ passes through an encoder to produce the representation $\phi(\mathbf{x})$, which is then processed by a decoder. The model performs multiple sampling trials to generate several discrete token sequences (e.g., the binary representation <1><1><0>). These sequences are individually detokenized into corresponding scalar values (shown in the stacked layers as $\hat{y}_1=6, \hat{y}_2=5, \hat{y}_3=7$). Finally, these scalar values are combined via an aggregation strategy (e.g., median) to produce the final prediction $\hat{y}=6$.
  • Figure 2: Comparison between local token-level training and global sequence-level update. Left (existing methods): The model is trained at each token $[t_1, \dots, t_n]$ with a local loss (e.g., CE) that focuses solely on individual tokens. Right (ours): The model generates a full sequence and detokenizes it into a prediction $\hat{y}$. A global reward (i.e., negative MSE) against the ground truth $y$ is then backpropagated to update the model parameters.
  • Figure 3: Training dynamics of GenRe$^2$. Top row: Normalized reward dynamics for GenRe$^2$ combined with ReMax (left) and GRPO (right) on 100 TALENT regression tasks, where the reward is normalized to $[0,1]$ with respect to each task. Bottom row: Visualization of regression performance dynamics on Kaggle_ bike_ sharing_ demand_ challangekaggle, comparing GenRe$^2$ with NTL-WAS ntl and DIST${}^2$DIST2 on test R${}^2$ score (left, higher is better) and test Wasserstein-1 distance (right, lower is better).
  • Figure 4: Average R${}^2$ over 100 TALENT regression tasks of different methods under varying normalized tokenization digit bases.
  • Figure 5: Metric dynamics across 100 TALENT regression tasks. The left sub-figure displays the average best R${}^2$@$k$, while the right one shows the average mean (dashed) and median (solid) R${}^2$@$k$.
  • ...and 8 more figures