GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Daniel Goldstein; Fares Obeid; Eric Alcaide; Guangyu Song; Eugene Cheah

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah

TL;DR

GoldFinch tackles the long-context bottleneck of attention-based models by introducing a hybrid RNN-attention architecture that compresses the KV-Cache by up to $\frac{d_{model}}{16}$ per token while maintaining $O(N)$ decoding and $O(1)$ pre-fill with an RNN-driven global cache. The method fuses Finch-C2 time mixing for the initial layers with GOLD transformer blocks, enabled by TokenCat decompression of a global cached key stream, resulting in a significantly smaller KV-Cache and improved downstream performance. Key findings show GoldFinch achieving lower final losses than Finch and Llama on 1.5B-parameter-class models, perfect MQAR recall, and strong long-context extrapolation when combined with RoPE, all while releasing code under the Apache 2.0 license. The work demonstrates practical benefits for extremely long-context language modeling on limited hardware and opens pathways for further memory reductions via quantization and alternate linear-attention backbones.

Abstract

We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state for a submitted context costs only O(1) time per token due to the use of a recurrent neural network (RNN) to generate this cache. We release our trained weights and training code under the Apache 2.0 license for community use.

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

TL;DR

GoldFinch tackles the long-context bottleneck of attention-based models by introducing a hybrid RNN-attention architecture that compresses the KV-Cache by up to

per token while maintaining

decoding and

pre-fill with an RNN-driven global cache. The method fuses Finch-C2 time mixing for the initial layers with GOLD transformer blocks, enabled by TokenCat decompression of a global cached key stream, resulting in a significantly smaller KV-Cache and improved downstream performance. Key findings show GoldFinch achieving lower final losses than Finch and Llama on 1.5B-parameter-class models, perfect MQAR recall, and strong long-context extrapolation when combined with RoPE, all while releasing code under the Apache 2.0 license. The work demonstrates practical benefits for extremely long-context language modeling on limited hardware and opens pathways for further memory reductions via quantization and alternate linear-attention backbones.

Abstract

Paper Structure (26 sections, 11 equations, 4 figures, 4 tables)

This paper contains 26 sections, 11 equations, 4 figures, 4 tables.

Introduction
Background
Other Concurrent Related Work
Method
Finch-C2 Time Mixing
GOLD Key Compression
GOLD Key Decompression (TokenCat)
GOLD Attention Time Mixing
GoldFinch Channel Mixing (same as Finch Channel Mixing)
GPTAlpha Time Mixing
Experiments
Architecture Comparisons
Ablation Studies
Associative Recall
Long Context Experiments
...and 11 more sections

Figures (4)

Figure 1: GoldFinch Architecture Block Diagram
Figure 2: Loss curves of 1.5B class models.
Figure 3: MQAR tasks. An increase in sequence length correlates with increased task difficulty.
Figure 4: Finch and GoldFinch on the same MQAR task with increased sequence length

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

TL;DR

Abstract

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (4)