Table of Contents
Fetching ...

PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding

Kangcong Li, Peng Ye, Chongjun Tu, Lin Zhang, Chunfeng Song, Jiamin Wu, Tao Yang, Qihao Zheng, Tao Chen

TL;DR

PaceLLM introduces brain-inspired mechanisms to address long-context challenges in LLMs by adding an Activation Memory Bank that mimics persistent working-memory activity and a Cortical Expert clustering scheme that reorganizes FFN weights into semantically coherent modules. The method operates training-free and is compatible with existing architectures, achieving notable gains on long-context benchmarks and extending usable context to 200K tokens in NIAH. Key contributions include a detailed AMB retrieval/update strategy with cosine similarity memory lookups and a constrained KMeans-based FFN reorganization that preserves inference compatibility. The resulting approach improves coherence and cross-token dependencies without extensive retraining, offering a practical, generalizable path to stronger long-context understanding with interpretable internal structure.

Abstract

While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons' persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench's Multi-document QA and 12.5-17.5% performance gains on Infinite-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls.

PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding

TL;DR

PaceLLM introduces brain-inspired mechanisms to address long-context challenges in LLMs by adding an Activation Memory Bank that mimics persistent working-memory activity and a Cortical Expert clustering scheme that reorganizes FFN weights into semantically coherent modules. The method operates training-free and is compatible with existing architectures, achieving notable gains on long-context benchmarks and extending usable context to 200K tokens in NIAH. Key contributions include a detailed AMB retrieval/update strategy with cosine similarity memory lookups and a constrained KMeans-based FFN reorganization that preserves inference compatibility. The resulting approach improves coherence and cross-token dependencies without extensive retraining, offering a practical, generalizable path to stronger long-context understanding with interpretable internal structure.

Abstract

While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons' persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench's Multi-document QA and 12.5-17.5% performance gains on Infinite-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls.

Paper Structure

This paper contains 40 sections, 16 equations, 4 figures, 12 tables, 2 algorithms.

Figures (4)

  • Figure 1: Schematic diagram of the PaceLLM (bottom) and its neuroscience counterpart (top). In this case, which introduces James Chadwick's character, the brain processes and retains key information through working memory. When the content in working memory appears in the subsequent text, such as "Britain", relevant neurons will persistently to be re-active. When the final question is input, the neuron with the keyword "neutron" will also persist to be re-activated, connect with other relevant neurons, and finally find the answer "Manhattan Project". Analogical to the mechanism of brain, PaceLLM expertly clustered FFN weights, and designed an Activation Memory Bank (AMB) to interact with activations.
  • Figure 2: The illustration of PaceLLM. The left of the figure is an overall pipeline. Note that Activation Memory Bank (AMB) doesn't interact with all FFN layers. The top right of the figure is a detailed illustration of the modified FFN layer. The bottom right is a detailed processing flow of AMB. ①Lookup Memory shows the process of similarity retrieval, taking the top$k$, and adding noise. ② shows the selection of reusing strategies by comparing similarity with threshold. ③ shows three strategies for updating the AMB.
  • Figure 3: Evaluation on Needle-In-A-Haystack. PaceLLM (bottom) can retrieve the needle up to 200K than Activation Beacon 128K (top).
  • Figure 4: Visualization of current and historical activations. The orange circles encircled the clusters of current and past activations, which means they have similar information and useful past activations are sufficiently reused. It illustrates PaceLLM leverages the AMB to retrieve semantically similar past activations, enabling repeated reuse in a manner analogous to working memory.