Table of Contents
Fetching ...

Predicting Through Generation: Why Generation Is Better for Prediction

Md Kowsher, Nusrat Jahan Prottasha, Prakash Bhat, Chun-Nam Yu, Mojtaba Soltanalian, Ivan Garibay, Ozlem Garibay, Chen Chen, Niloofar Yousefi

TL;DR

This work argues that token-level generation yields richer, more task-relevant information for prediction than pooling-based classifiers, supported by the Data Processing Inequality. It introduces PredGen, an end-to-end framework that uses scheduled sampling to mitigate exposure bias and a Task Adapter to convert generated tokens into structured outputs, complemented by Writer-Director Alignment Loss (WDAL) to align generation with final predictions. Theoretical results and empirical mutual-information estimates show generation preserves more information than pooling, and extensive experiments across classification, regression, and arithmetic reasoning demonstrate consistent gains over traditional baselines. The approach enables robust, numerically precise predictions from large language models while addressing formatting and coherence through WDAL and the task adapter. Together, these contributions extend token-level generation to structured prediction with practical improvements in accuracy and numerical fidelity across diverse benchmarks.

Abstract

This paper argues that generating output tokens is more effective than using pooled representations for prediction tasks because token-level generation retains more mutual information. Since LLMs are trained on massive text corpora using next-token prediction, generation aligns naturally with their learned behavior. Using the Data Processing Inequality (DPI), we provide both theoretical and empirical evidence supporting this claim. However, autoregressive models face two key challenges when used for prediction: (1) exposure bias, where the model sees ground truth tokens during training but relies on its own predictions during inference, leading to errors, and (2) format mismatch, where discrete tokens do not always align with the tasks required output structure. To address these challenges, we introduce PredGen(Predicting Through Generating), an end to end framework that (i) uses scheduled sampling to reduce exposure bias, and (ii) introduces a task adapter to convert the generated tokens into structured outputs. Additionally, we introduce Writer-Director Alignment Loss (WDAL), which ensures consistency between token generation and final task predictions, improving both text coherence and numerical accuracy. We evaluate PredGen on multiple classification and regression benchmarks. Our results show that PredGen consistently outperforms standard baselines, demonstrating its effectiveness in structured prediction tasks.

Predicting Through Generation: Why Generation Is Better for Prediction

TL;DR

This work argues that token-level generation yields richer, more task-relevant information for prediction than pooling-based classifiers, supported by the Data Processing Inequality. It introduces PredGen, an end-to-end framework that uses scheduled sampling to mitigate exposure bias and a Task Adapter to convert generated tokens into structured outputs, complemented by Writer-Director Alignment Loss (WDAL) to align generation with final predictions. Theoretical results and empirical mutual-information estimates show generation preserves more information than pooling, and extensive experiments across classification, regression, and arithmetic reasoning demonstrate consistent gains over traditional baselines. The approach enables robust, numerically precise predictions from large language models while addressing formatting and coherence through WDAL and the task adapter. Together, these contributions extend token-level generation to structured prediction with practical improvements in accuracy and numerical fidelity across diverse benchmarks.

Abstract

This paper argues that generating output tokens is more effective than using pooled representations for prediction tasks because token-level generation retains more mutual information. Since LLMs are trained on massive text corpora using next-token prediction, generation aligns naturally with their learned behavior. Using the Data Processing Inequality (DPI), we provide both theoretical and empirical evidence supporting this claim. However, autoregressive models face two key challenges when used for prediction: (1) exposure bias, where the model sees ground truth tokens during training but relies on its own predictions during inference, leading to errors, and (2) format mismatch, where discrete tokens do not always align with the tasks required output structure. To address these challenges, we introduce PredGen(Predicting Through Generating), an end to end framework that (i) uses scheduled sampling to reduce exposure bias, and (ii) introduces a task adapter to convert the generated tokens into structured outputs. Additionally, we introduce Writer-Director Alignment Loss (WDAL), which ensures consistency between token generation and final task predictions, improving both text coherence and numerical accuracy. We evaluate PredGen on multiple classification and regression benchmarks. Our results show that PredGen consistently outperforms standard baselines, demonstrating its effectiveness in structured prediction tasks.

Paper Structure

This paper contains 50 sections, 3 theorems, 85 equations, 13 figures, 6 tables.

Key Result

Theorem 1

Let $\mathbf{X}$ be an input random variable, and let $\mathbf{Z} \in \mathcal{Z}$ be the final hidden representation produced by a model (e.g., an LLM). Suppose $\mathbf{Z_p} = g( \mathbf{Z})$ for some deterministic function $g : \mathcal{Z} \to \mathcal{W}$ (e.g., first-token pooling or mean pooli

Figures (13)

  • Figure 1: Comparison of different prediction methods using a language model. (Left) The traditional approach where a pooled representation $\mathbf{Z_p}$ is passed to a classifier for prediction. (Middle) A similar method where $\mathbf{Z_p}$ is extracted from the hidden states and used for classification. (Right) The generative approach, where the model generates additional tokens $\mathbf{Y_1}, \mathbf{Y_2}, ..., \mathbf{Y_{m}}$, and their hidden states are processed by a task adapter for prediction. This method retains more task-relevant information by using token-level generation.
  • Figure 2: Comparison of mutual information estimates for Predictor, Generator, and PredGen across multiple datasets. PredGen consistently retains higher mutual information, supporting the theoretical claim that token-level generation preserves richer task-relevant information than pooled representations.
  • Figure 3: Token-wise mutual information on SST-2 socher2013recursive. The predicted token "positive" shows high MI with sentiment-related tokens like "funny" (0.47) and "pretty" (0.34), highlighting strong contextual dependencies.
  • Figure 4: Effect of max_steps_for_sampling on performance. A gradual transition (max_steps_for_sampling = 1000) achieves the best performance, balancing reference-based and self-generated predictions.
  • Figure 5: MSE loss comparison between Sequence-Level and Token-Level Sampling across datasets.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Theorem 2
  • proof
  • Theorem 3
  • proof