Folded Context Condensation in Path Integral Formalism for Infinite Context Transformers
Won-Gi Paeng, Daesuk Kwon, Kyungwon Jeong, Honggyo Suh
TL;DR
The paper tackles the challenge of long-range dependency modeling in Transformers by introducing a Path Integral-based generalization where attention sums over all possible token-state trajectories and layers act as time-evolution operators. It presents Folded Context Condensation, a memory-augmented, segment-wise recurrence scheme with a phase-coherence mechanism that preserves historical context while achieving linear memory scaling. Key contributions include a formal mapping between Transformer components and Path Integral elements, a practical recurrence-based implementation with memory buffers across segments, and empirical validation on Passkey retrieval and long-document summarization showing competitive performance with substantially reduced memory usage. The work offers a quantum-inspired perspective that improves both efficiency and interpretability, enabling scalable processing of very long sequences while maintaining predictive accuracy.
Abstract
In this work, we present a generalized formulation of the Transformer algorithm by reinterpreting its core mechanisms within the framework of Path Integral formalism. In this perspective, the attention mechanism is recast as a process that integrates all possible transition paths leading to future token states, with temporal evolution governed by the Feed-Forward Network. By systematically mapping each component of the Transformer to its counterpart in the Path Integral formulation, we obtain a more compact and efficient representation, in which the contextual information of a sequence is condensed into memory-like segments. These segments are recurrently processed across Transformer layers, enabling more effective long-term information retention. We validate the effectiveness of this approach through the Passkey retrieval task and a summarization task, demonstrating that the proposed method preserves historical information while exhibiting memory usage that scales linearly with sequence length. This contrasts with the non-linear memory growth typically observed in standard attention mechanisms. We expect that this quantum-inspired generalization of the Transformer architecture will open new avenues for enhancing both the efficiency and expressiveness of future Transformer models.
