Table of Contents
Fetching ...

Cross-attention for State-based model RWKV-7

Liu Xiao, Li Zhiyuan, Lin Yueyu

TL;DR

The work tackles the inefficiency of Transformer cross-attention in text-to-image generation by leveraging the linear-complexity RWKV-7 architecture. It introduces CrossWKV, a cross-modal module enabling global cross-attention in a single pass with linear complexity $O(T \cdot N)$ and constant memory, using LoRA and group normalization. The model, integrated into the Diffusion in RWKV-7 (DIR-7), achieves competitive Fréchet Inception Distance (FID) and CLIP scores on ImageNet 256×256 and demonstrates robustness across multilingual prompts and high-resolution generation, while maintaining scalable performance on resource-limited devices. This approach offers a practical, scalable path to high-quality cross-modal generation and state manipulation with potential extensions to edge devices and complex reasoning tasks.

Abstract

We introduce CrossWKV, a novel cross-attention mechanism for the state-based RWKV-7 model, designed to enhance the expressive power of text-to-image generation. Leveraging RWKV-7's linear-complexity Weighted Key-Value (WKV) architecture, CrossWKV integrates text and image modalities in a single pass, utilizing a generalized delta rule with vector-valued gating and low-rank adaptations (LoRA) to achieve superior cross-modal alignment. Unlike Transformer-based models, CrossWKV's non-diagonal, input-dependent transition matrix enables it to represent complex functions beyond the $\mathrm{TC}^0$ complexity class, including all regular languages, as demonstrated by its ability to perform state-tracking tasks like $S_5$ permutation modeling. Evaluated within the Diffusion in RWKV-7 (DIR-7) on datasets such as LAION-5B and ImageNet, CrossWKV achieves a Frechet Inception Distance (FID) of 2.88 and a CLIP score of 0.33 on ImageNet 256x256, matching state-of-the-art performance while offering robust generalization across diverse prompts. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks, with potential applications in high-resolution generation and dynamic state manipulation.Code at https://github.com/TorchRWKV/flash-linear-attention

Cross-attention for State-based model RWKV-7

TL;DR

The work tackles the inefficiency of Transformer cross-attention in text-to-image generation by leveraging the linear-complexity RWKV-7 architecture. It introduces CrossWKV, a cross-modal module enabling global cross-attention in a single pass with linear complexity and constant memory, using LoRA and group normalization. The model, integrated into the Diffusion in RWKV-7 (DIR-7), achieves competitive Fréchet Inception Distance (FID) and CLIP scores on ImageNet 256×256 and demonstrates robustness across multilingual prompts and high-resolution generation, while maintaining scalable performance on resource-limited devices. This approach offers a practical, scalable path to high-quality cross-modal generation and state manipulation with potential extensions to edge devices and complex reasoning tasks.

Abstract

We introduce CrossWKV, a novel cross-attention mechanism for the state-based RWKV-7 model, designed to enhance the expressive power of text-to-image generation. Leveraging RWKV-7's linear-complexity Weighted Key-Value (WKV) architecture, CrossWKV integrates text and image modalities in a single pass, utilizing a generalized delta rule with vector-valued gating and low-rank adaptations (LoRA) to achieve superior cross-modal alignment. Unlike Transformer-based models, CrossWKV's non-diagonal, input-dependent transition matrix enables it to represent complex functions beyond the complexity class, including all regular languages, as demonstrated by its ability to perform state-tracking tasks like permutation modeling. Evaluated within the Diffusion in RWKV-7 (DIR-7) on datasets such as LAION-5B and ImageNet, CrossWKV achieves a Frechet Inception Distance (FID) of 2.88 and a CLIP score of 0.33 on ImageNet 256x256, matching state-of-the-art performance while offering robust generalization across diverse prompts. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks, with potential applications in high-resolution generation and dynamic state manipulation.Code at https://github.com/TorchRWKV/flash-linear-attention

Paper Structure

This paper contains 21 sections, 17 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: the CrossWKV mechanism