Table of Contents
Fetching ...

Exploring System 1 and 2 communication for latent reasoning in LLMs

Julian Coda-Forno, Zhuokai Zhao, Qiang Zhang, Dipesh Tamboli, Weiwei Li, Xiangjun Fan, Lizhu Zhang, Eric Schulz, Hsiao-Ping Tseng

TL;DR

The paper investigates whether latent reasoning in LLMs should reside in a separate Coprocessor or within a single model's forward pass. It evaluates two communication-focused variants of a KV-cache Coprocessor and compares them to a unified soft-embedding baseline under matched latent budgets, across reasoning and pretraining tasks. Findings show that co-finetuning yields the strongest gains among dual designs, but a parameter-matched single model with soft latent prompts often matches or surpasses dual setups, indicating added compute rather than qualitative reasoning advantages. Explicit latent-space objectives, such as orthogonality regularization, can restore specialized latent roles and improve combinatorial reasoning, though they may trade off general language modeling performance, highlighting directions for future curriculum and objective design to enact System-2-like latent reasoning.

Abstract

Should LLM reasoning live in a separate module, or within a single model's forward pass and representational space? We study dual-architecture latent reasoning, where a fluent Base exchanges latent messages with a Coprocessor, and test two hypotheses aimed at improving latent communication over Liu et al. (2024): (H1) increase channel capacity; (H2) learn communication via joint finetuning. Under matched latent-token budgets on GPT-2 and Qwen-3, H2 is consistently strongest while H1 yields modest gains. A unified soft-embedding baseline, a single model with the same forward pass and shared representations, using the same latent-token budget, nearly matches H2 and surpasses H1, suggesting current dual designs mostly add compute rather than qualitatively improving reasoning. Across GSM8K, ProsQA, and a Countdown stress test with increasing branching factor, scaling the latent-token budget beyond small values fails to improve robustness. Latent analyses show overlapping subspaces with limited specialization, consistent with weak reasoning gains. We conclude dual-model latent reasoning remains promising in principle, but likely requires objectives and training schedules that explicitly shape latent spaces for algorithmic planning.

Exploring System 1 and 2 communication for latent reasoning in LLMs

TL;DR

The paper investigates whether latent reasoning in LLMs should reside in a separate Coprocessor or within a single model's forward pass. It evaluates two communication-focused variants of a KV-cache Coprocessor and compares them to a unified soft-embedding baseline under matched latent budgets, across reasoning and pretraining tasks. Findings show that co-finetuning yields the strongest gains among dual designs, but a parameter-matched single model with soft latent prompts often matches or surpasses dual setups, indicating added compute rather than qualitative reasoning advantages. Explicit latent-space objectives, such as orthogonality regularization, can restore specialized latent roles and improve combinatorial reasoning, though they may trade off general language modeling performance, highlighting directions for future curriculum and objective design to enact System-2-like latent reasoning.

Abstract

Should LLM reasoning live in a separate module, or within a single model's forward pass and representational space? We study dual-architecture latent reasoning, where a fluent Base exchanges latent messages with a Coprocessor, and test two hypotheses aimed at improving latent communication over Liu et al. (2024): (H1) increase channel capacity; (H2) learn communication via joint finetuning. Under matched latent-token budgets on GPT-2 and Qwen-3, H2 is consistently strongest while H1 yields modest gains. A unified soft-embedding baseline, a single model with the same forward pass and shared representations, using the same latent-token budget, nearly matches H2 and surpasses H1, suggesting current dual designs mostly add compute rather than qualitatively improving reasoning. Across GSM8K, ProsQA, and a Countdown stress test with increasing branching factor, scaling the latent-token budget beyond small values fails to improve robustness. Latent analyses show overlapping subspaces with limited specialization, consistent with weak reasoning gains. We conclude dual-model latent reasoning remains promising in principle, but likely requires objectives and training schedules that explicitly shape latent spaces for algorithmic planning.

Paper Structure

This paper contains 32 sections, 11 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Overview of the deliberation–in–KV-cache architecture of liu2024deliberationlatentspacedifferentiable and our two variants designed to strengthen cross-module communication.
  • Figure 2: Validation perplexity whilst training on the FinWeb-Edu-100BT corpus. A: GPT-2 variants (using $N_L{=}16$ where applicable). B: Qwen-3 variants (using $N_L{=}16$ where applicable). C: Ablating number of latents $N_L$ of the GPT-2 Coprocessor for Hypothesis 2. D: Ablating number of latents $N_L$ of the Qwen-3 Coprocessor for Hypothesis 2.
  • Figure 3: Ablating the latent budget. A: GSM8K accuracy vs. total latents $N_L$ (GPT-2 and Qwen; Coconut shown for GPT-2). B: Countdown accuracy vs. operands (3, 4, 5) with lines for $N_L\!\in\!\{1,4,8, 16\}$, merged across model families.
  • Figure 4: Latent cross-subspace capture heatmaps (last Coprocessor layer). Each cell $(i,j)$ shows the fraction of variance of latent $j$ captured by the principal subspace of latent $i$. A: Large-scale training. B: Fine-tuning on GSM8k with curriculum. C: Finetuning on countdown with operands = 4.
  • Figure 5: Explicit Latent Regularization Results. A: Large scale training validation perplexity (lower is better) worsens as regularization $\lambda$ increases. B: Countdown (operands=4) accuracy restores monotonic scaling with $N_L$ under strong regularization ($\lambda=3$). C: Latent cross-capture heatmap for Countdown ($\lambda=3$) shows distinct, non-overlapping subspaces (compare to Fig 4).
  • ...and 11 more figures