Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

Sidi Lu; Zhenwen Liang; Dongyang Ma; Yan Wang; Haitao Mi; Dong Yu

Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, Dong Yu

TL;DR

This work addresses the inefficiency and forgetting challenges of test-time adaptation by introducing Locas, a locally-supported parametric memory that complements transformer FFNs through a principled initialization strategy. It presents two variants, Locas-MLP and Locas-GLU, which write new key–value pairs into a sideway FFN to expand capacity without altering backbone parameters, aided by activation-based initialization and, for Locas-GLU, activation-guided cloning and zero-valued values. The authors additionally propose a compression approach (NL-SVD) for Locas-MLP, though empirical results favor standard backpropagation for memory updates due to cost considerations. Experiments on PG-19 and LoCoMo demonstrate strong parameter and compute efficiency, with Locas-GLU achieving competitive or better performance than TempLoRA while incurring far fewer added parameters and reduced computational overhead, and showing minimal catastrophic forgetting on general capabilities like MMLU. Overall, Locas enables rapid, memory-augmented continual learning within the model, offering practical benefits for long-context modeling and domain adaptation with scalable, robust memory integration.

Abstract

In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories -- performed in a principled way by reusing model parameters, activations and/or gradients -- is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02\% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model's general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model's existing internal knowledge.

Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

TL;DR

Abstract

Paper Structure (50 sections, 19 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 50 sections, 19 equations, 2 figures, 5 tables, 1 algorithm.

Introduction
Methodology
Notation: FFNs are Soft Look-up Table Memories
Two Variants: Locas-MLP and Locas-GLU
Locas-MLP
Locas-GLU
Memory Initialization for Locas-MLP: Activation and Gradient Reusage Yields Step-wise Optimal Initialization
Memory Initialization for Locas-GLU: Activation-Guided Parameter Cloning
Activation-Based Basis Selection
Top-$K$ Selection as Nonlinear PCA in Activation Space
Parameter Cloning for Key and Gate Matrices
Zero Initialization for Value Matrix
Interpretation
Weight Norm Clipping for Implicit KL Constraint
Output Scaling Factor
...and 35 more sections

Figures (2)

Figure 1: Illustration of a typical dense transformer layer with FFN interpreted as a soft look-up table memory, in comparison with the attention mechanism, which is a contextual soft look-up table mechanism. The GLU variant follows a similar structure but with an additional gating mechanism.
Figure 2: Architecture of the proposed Locas parametric memory integrated as a sideway FFN module in transformer layers. The memory module operates in parallel with the backbone FFN, with its output scaled and added to the main pathway. This design enables genuine model capacity expansion at test time while preserving the backbone model's pretrained representations.

Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

TL;DR

Abstract

Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

Authors

TL;DR

Abstract

Table of Contents

Figures (2)