Table of Contents
Fetching ...

EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Jiyu Chen, Shuang Peng, Daxiong Luo, Fan Yang, Renshou Wu, Fangyuan Li, Xiaoxin Chen

TL;DR

EdgeInfinite addresses the challenge of long-context processing on edge devices by integrating compressed memory into Transformer LLMs through a trainable memory-gating module, while preserving compatibility with standard Transformer architectures. It introduces three architectural pillars: segmented attention with ROPE for local context, a memory compression-decompression mechanism to store past context, and a memory-gating module that adaptively blends memory-based and local attention, with fine-tuning limited to a small parameter subset. The inference strategy further avoids quality loss by preserving sink and window tokens and by dynamically routing between long-context memory and short-context KV caches. Experimental results on LongBench with a BlueLM-3B backbone show EdgeInfinite achieving competitive or superior performance to baseline KV-cache methods and even FullKV, while markedly reducing memory usage and time to first token for long sequences, making it well-suited for edge deployment. Overall, EdgeInfinite provides a practical, transformer-compatible path to scalable infinite-context inference on resource-constrained devices.

Abstract

Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.

EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

TL;DR

EdgeInfinite addresses the challenge of long-context processing on edge devices by integrating compressed memory into Transformer LLMs through a trainable memory-gating module, while preserving compatibility with standard Transformer architectures. It introduces three architectural pillars: segmented attention with ROPE for local context, a memory compression-decompression mechanism to store past context, and a memory-gating module that adaptively blends memory-based and local attention, with fine-tuning limited to a small parameter subset. The inference strategy further avoids quality loss by preserving sink and window tokens and by dynamically routing between long-context memory and short-context KV caches. Experimental results on LongBench with a BlueLM-3B backbone show EdgeInfinite achieving competitive or superior performance to baseline KV-cache methods and even FullKV, while markedly reducing memory usage and time to first token for long sequences, making it well-suited for edge deployment. Overall, EdgeInfinite provides a practical, transformer-compatible path to scalable infinite-context inference on resource-constrained devices.

Abstract

Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.

Paper Structure

This paper contains 16 sections, 10 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: The overall framework of EdgeInfinite: illustrating the computation process of the attention layer in Transformer-based LLMs, with LLaMA Attention touvron2023llamagrattafiori2024llama as an example.
  • Figure 2: The inference strategy of EdgeInfinite.
  • Figure 3: Efficiency of EdgeInfinite. We demonstrate GPU memory consumption and TTFT for varying input sequence lengths.