Table of Contents
Fetching ...

Lightweight Latent Reasoning for Narrative Tasks

Alexander Gurung, Nikolay Malkin, Mirella Lapata

TL;DR

LiteReason introduces a lightweight Reasoning Projector to enable latent reasoning that can be interleaved with discrete token sampling in narrative tasks. By training the projector with supervised fine-tuning and integrating reinforcement learning, the method achieves near non-latent RL performance while dramatically reducing reasoning length and token usage. Evaluations on Flawed Fictions and Next-Chapter Prediction show substantial token savings during training and inference, with LiteReason outperforming existing latent-reasoning baselines. The approach offers a practical, scalable path to efficient long-context reasoning in narrative settings and invites further exploration of latent architectures and prompting strategies.

Abstract

Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.

Lightweight Latent Reasoning for Narrative Tasks

TL;DR

LiteReason introduces a lightweight Reasoning Projector to enable latent reasoning that can be interleaved with discrete token sampling in narrative tasks. By training the projector with supervised fine-tuning and integrating reinforcement learning, the method achieves near non-latent RL performance while dramatically reducing reasoning length and token usage. Evaluations on Flawed Fictions and Next-Chapter Prediction show substantial token savings during training and inference, with LiteReason outperforming existing latent-reasoning baselines. The approach offers a practical, scalable path to efficient long-context reasoning in narrative settings and invites further exploration of latent architectures and prompting strategies.

Abstract

Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.

Paper Structure

This paper contains 34 sections, 5 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: High-level diagram of LiteReason. Discrete sampling (via the LM head) is performed as normal, selecting a token (e.g., $x_1$) and passing its corresponding discrete token embedding, until we encounter special 'implicit-thought' tags represented here by "<bot>". We then switch to latent reasoning mode and use the Reasoning Projector to directly predict continuous token embeddings (e.g., $e_0$) for a number of forward passes before switching back to discrete sampling. We can switch between discrete and reasoning mode multiple times before producing the final answer with discrete sampling. When training the Reasoning Projector we 1) randomly replace reasoning steps with the implicit thought tags and 2) freeze the rest of the LLM and apply a cross-entropy loss on the remaining reasoning steps (discrete tokens).
  • Figure 2: Bradley Terry Relative Strength on NCP task. Default refers to the untrained Qwen2.5-7B model. RL-trained is the upper bound trained with non-latent RL. We find relative strengths largely follow the trend of contrastive improvement in \ref{['tab:combined_method_comparisons']}, with RL-training performing the best, followed by LiteReason, and then a gap and the rest of the methods.
  • Figure 3: Generated tokens vs Accuracy for the Flawed Fictions benchmark, by model type. Note that the goal is to be to the right (more accurate) and lower (fewer tokens). We find that LiteReason performs significantly more efficiently than the standard RL-Trained model, with only slightly worse performance. Aside from RL-Trained, there is a significant gap in both accuracy (about 20%) and efficiency (about 100 tokens) between LiteReason and the next best method. Thus, we claim our model sits on the same Pareto frontier as the RL-Trained baseline.
  • Figure 4: Generated tokens vs Contrastive Improvement for the Next-Chapter Prediction task, by model type. The goal is to be to the right (higher Contrastive Improvement) and lower (fewer tokens). We find that LiteReason performs significantly more efficiently than the standard RL-Trained model, with only slightly worse performance. Aside from RL-Trained, there is a significant gap in both Contrastive Improvement and efficiency between LiteReason and the next best method. Thus, we claim our model sits on the same Pareto frontier as the RL-Trained baseline.