Table of Contents
Fetching ...

Beyond Words: A Latent Memory Approach to Internal Reasoning in LLMs

José I. Orlicki

TL;DR

The paper addresses the inefficiency of explicit chain-of-thought reasoning in LLMs by proposing an Implicit Memory Module (IMM) that stores and retrieves latent representations within a differentiable memory bank $M \in \mathbb{R}^{N \times d}$. It formalizes the IMM with write and read operations, using $s_t = f_{\text{write}}(h_t)$, $q_t = f_{\text{query}}(h_t)$, and retrieval $r_t = \sum_i \alpha_i M[i]$ where $\alpha = \text{softmax}(M q_t^\top)$, integrating retrieved information into the current hidden state. An optional lightweight explicit CoT decoder can be attached for auditability without burdening core reasoning. Empirical results on Shakespeare with nanoGPT show substantial final loss reductions across context windows (≈35–58%), achieved with Linformer-style compression that keeps overhead modest and scales with embedding size. The framework offers a practical, scalable path toward efficient internal reasoning in larger LLMs, with future work exploring memory-slot ablations, adaptive long-term memory, and safer explicit oversight mechanisms.

Abstract

Recent advances in large language models (LLMs) have popularized the chain-of-thought (CoT) paradigm, in which models produce explicit reasoning steps in natural language. Although this approach improves interpretability and facilitates external auditing, it may not represent the most computationally efficient method for internal reasoning. In contrast, human cognition relies on implicit mental representations that recall past sensory and episodic information without requiring complete verbalization. In this paper, we propose a framework that integrates implicit mental representations into the internal reasoning processes of LLMs. Preliminary experiments indicate that incorporating an Implicit Memory Module (IMM) into a simple GPT model yields a reduction of between 35% and 57% in final training loss compared to a regular GPT baseline. The addition of an explicit interpretability channel (e.g., a chain-of-thought decoder) is straightforward to implement within this approach. We outline theoretical foundations, propose technical mechanisms to scale the memory module, and discuss how these ideas may lead to more efficient and robust reasoning, with optional future extensions for explicit auditability.

Beyond Words: A Latent Memory Approach to Internal Reasoning in LLMs

TL;DR

The paper addresses the inefficiency of explicit chain-of-thought reasoning in LLMs by proposing an Implicit Memory Module (IMM) that stores and retrieves latent representations within a differentiable memory bank . It formalizes the IMM with write and read operations, using , , and retrieval where , integrating retrieved information into the current hidden state. An optional lightweight explicit CoT decoder can be attached for auditability without burdening core reasoning. Empirical results on Shakespeare with nanoGPT show substantial final loss reductions across context windows (≈35–58%), achieved with Linformer-style compression that keeps overhead modest and scales with embedding size. The framework offers a practical, scalable path toward efficient internal reasoning in larger LLMs, with future work exploring memory-slot ablations, adaptive long-term memory, and safer explicit oversight mechanisms.

Abstract

Recent advances in large language models (LLMs) have popularized the chain-of-thought (CoT) paradigm, in which models produce explicit reasoning steps in natural language. Although this approach improves interpretability and facilitates external auditing, it may not represent the most computationally efficient method for internal reasoning. In contrast, human cognition relies on implicit mental representations that recall past sensory and episodic information without requiring complete verbalization. In this paper, we propose a framework that integrates implicit mental representations into the internal reasoning processes of LLMs. Preliminary experiments indicate that incorporating an Implicit Memory Module (IMM) into a simple GPT model yields a reduction of between 35% and 57% in final training loss compared to a regular GPT baseline. The addition of an explicit interpretability channel (e.g., a chain-of-thought decoder) is straightforward to implement within this approach. We outline theoretical foundations, propose technical mechanisms to scale the memory module, and discuss how these ideas may lead to more efficient and robust reasoning, with optional future extensions for explicit auditability.

Paper Structure

This paper contains 22 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: Model architecture showing the Transformer layers with the Implicit Memory Module (IMM) and an optional explicit decoder for future interpretability enhancements.
  • Figure 2: Zoom-in of the Transformer + IMM submodule. The input hidden state $h_t$ is processed by a write function $f_{\text{write}}$ to produce a summary $s_t$ that is stored in the Memory Bank $M$. Simultaneously, $h_t$ is used to compute a query $q_t$, which retrieves relevant memory $r_t$ through attention. The retrieved memory is transformed by $g(\cdot)$ and integrated with $h_t$ to produce the updated hidden state $\tilde{h}_t$.
  • Figure 3: Loss comparison for nanoGPT models (similar to GPT-2) trained on the Shakespeare dataset with the following default parameters: block_size=64, batch_size=12, n_layer=4, n_head=4, n_embd=128, max_iters=2000, lr_decay_iters=2000, dropout=0.0. The tokens per iteration is 768.
  • Figure 4: Loss comparison for nanoGPT models (similar to GPT-2) trained on the Shakespeare dataset with the following default parameters: block_size=128, batch_size=12, n_layer=4, n_head=4, n_embd=256, max_iters=2000, lr_decay_iters=2000, dropout=0.0. The tokens per iteration is 1536.
  • Figure 5: Loss comparison for nanoGPT models (similar to GPT-2) trained on the Shakespeare dataset with the following default parameters: block_size=256, batch_size=12, n_layer=4, n_head=4, n_embd=512, max_iters=2000, lr_decay_iters=2000, dropout=0.0. The tokens per iteration is 3072.