Table of Contents
Fetching ...

Folded Context Condensation in Path Integral Formalism for Infinite Context Transformers

Won-Gi Paeng, Daesuk Kwon, Kyungwon Jeong, Honggyo Suh

TL;DR

The paper tackles the challenge of long-range dependency modeling in Transformers by introducing a Path Integral-based generalization where attention sums over all possible token-state trajectories and layers act as time-evolution operators. It presents Folded Context Condensation, a memory-augmented, segment-wise recurrence scheme with a phase-coherence mechanism that preserves historical context while achieving linear memory scaling. Key contributions include a formal mapping between Transformer components and Path Integral elements, a practical recurrence-based implementation with memory buffers across segments, and empirical validation on Passkey retrieval and long-document summarization showing competitive performance with substantially reduced memory usage. The work offers a quantum-inspired perspective that improves both efficiency and interpretability, enabling scalable processing of very long sequences while maintaining predictive accuracy.

Abstract

In this work, we present a generalized formulation of the Transformer algorithm by reinterpreting its core mechanisms within the framework of Path Integral formalism. In this perspective, the attention mechanism is recast as a process that integrates all possible transition paths leading to future token states, with temporal evolution governed by the Feed-Forward Network. By systematically mapping each component of the Transformer to its counterpart in the Path Integral formulation, we obtain a more compact and efficient representation, in which the contextual information of a sequence is condensed into memory-like segments. These segments are recurrently processed across Transformer layers, enabling more effective long-term information retention. We validate the effectiveness of this approach through the Passkey retrieval task and a summarization task, demonstrating that the proposed method preserves historical information while exhibiting memory usage that scales linearly with sequence length. This contrasts with the non-linear memory growth typically observed in standard attention mechanisms. We expect that this quantum-inspired generalization of the Transformer architecture will open new avenues for enhancing both the efficiency and expressiveness of future Transformer models.

Folded Context Condensation in Path Integral Formalism for Infinite Context Transformers

TL;DR

The paper tackles the challenge of long-range dependency modeling in Transformers by introducing a Path Integral-based generalization where attention sums over all possible token-state trajectories and layers act as time-evolution operators. It presents Folded Context Condensation, a memory-augmented, segment-wise recurrence scheme with a phase-coherence mechanism that preserves historical context while achieving linear memory scaling. Key contributions include a formal mapping between Transformer components and Path Integral elements, a practical recurrence-based implementation with memory buffers across segments, and empirical validation on Passkey retrieval and long-document summarization showing competitive performance with substantially reduced memory usage. The work offers a quantum-inspired perspective that improves both efficiency and interpretability, enabling scalable processing of very long sequences while maintaining predictive accuracy.

Abstract

In this work, we present a generalized formulation of the Transformer algorithm by reinterpreting its core mechanisms within the framework of Path Integral formalism. In this perspective, the attention mechanism is recast as a process that integrates all possible transition paths leading to future token states, with temporal evolution governed by the Feed-Forward Network. By systematically mapping each component of the Transformer to its counterpart in the Path Integral formulation, we obtain a more compact and efficient representation, in which the contextual information of a sequence is condensed into memory-like segments. These segments are recurrently processed across Transformer layers, enabling more effective long-term information retention. We validate the effectiveness of this approach through the Passkey retrieval task and a summarization task, demonstrating that the proposed method preserves historical information while exhibiting memory usage that scales linearly with sequence length. This contrasts with the non-linear memory growth typically observed in standard attention mechanisms. We expect that this quantum-inspired generalization of the Transformer architecture will open new avenues for enhancing both the efficiency and expressiveness of future Transformer models.
Paper Structure (13 sections, 16 equations, 12 figures, 3 tables)

This paper contains 13 sections, 16 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Visualization of token state evolution in the Path Integral framework. The state $|x_i, t_0\rangle$ at time $t_0$ evolves through multiple interaction pathways before reaching $|x_i, t_f\rangle$(left side). Each evolution from $t_j$ to $t_j + \Delta t$ can be described as the transitions to all possible token states and the temporal evolutions of the projection states(right side).
  • Figure 2: Underlying architecture of the model. A segment $|x^s,\, t_n^s \rangle$ from the previous layer is projected into a query and also concatenated with time-evolved memory buffer $|x^{s-1},\, t_n^{s-1}+\Delta t_n \rangle$ to be key and value. The output of the current layer will be saved as a memory buffer and retrieved when the next segment arrives.
  • Figure 3: The attention mask is divided into two regions: $K^{s-1}$ represents memory keys that allow full attention across all positions, while $K^{s}$ represents current keys, where attention is restricted with a causal masking pattern. $Q^s$ denotes the current query, interacting selectively with both $K^{s-1}$ and $K^{s}$.
  • Figure 4: Each output of each layer for the input segment is preserved as a memory buffer. Then, the memory buffer is concatenated with the current segment and takes part in the attention.
  • Figure 5: Memory usage comparison as a function of total sequence length. The plot illustrates memory consumption (in GB) for Condensation and Llama models across batch sizes 1 and 4.
  • ...and 7 more figures