Table of Contents
Fetching ...

Comba: Improving Bilinear RNNs with Closed-loop Control

Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, Weigao Sun

TL;DR

Comba introduces a closed-loop Bilinear RNN with scalar-plus-low-rank state transitions and output correction, blending control theory with neural memory to achieve robust, hardware-friendly chunk-wise parallel training. By leveraging WY representations and UT transforms, Comba attains faster pretraining and improved performance on both language and vision tasks across 340M and 1.3B parameter scales. The approach addresses limitations of prior Delta-based Bilinear RNNs by enabling principled memory forgetting, improved recall, and stable long-context modeling, while maintaining compatibility with hybrid architectures. Limitations include evaluation at moderate scales and partial comparisons with newer nonlinear RNNs; future work targets larger-scale benchmarking and deeper integration with hybrid attention mechanisms like GSA.

Abstract

Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, structurally resembling bilinear systems. In this paper, we first introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates superior performance and computation efficiency in both language and vision modeling.

Comba: Improving Bilinear RNNs with Closed-loop Control

TL;DR

Comba introduces a closed-loop Bilinear RNN with scalar-plus-low-rank state transitions and output correction, blending control theory with neural memory to achieve robust, hardware-friendly chunk-wise parallel training. By leveraging WY representations and UT transforms, Comba attains faster pretraining and improved performance on both language and vision tasks across 340M and 1.3B parameter scales. The approach addresses limitations of prior Delta-based Bilinear RNNs by enabling principled memory forgetting, improved recall, and stable long-context modeling, while maintaining compatibility with hybrid architectures. Limitations include evaluation at moderate scales and partial comparisons with newer nonlinear RNNs; future work targets larger-scale benchmarking and deeper integration with hybrid attention mechanisms like GSA.

Abstract

Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, structurally resembling bilinear systems. In this paper, we first introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates superior performance and computation efficiency in both language and vision modeling.

Paper Structure

This paper contains 31 sections, 14 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Householder transform as mirror transform with factor $\beta$.
  • Figure 2: Comba Families. The Mamba-like architecture omits MLP layers, uses multi-value attention, and doubles the model depth. For the hybrid model, we incorporate sliding window attention in flexible proportions to boost the model's recall ability. The window size is set to the context length, equivalent to softmax attention.
  • Figure 3: Operator speed evaluated on the Triton-Testing-Benchmarktillet2019triton (fwd and bwd) in single A800-80G GPU.
  • Figure 4: Training loss on 8$\times$ A800 GPUs with logging 32.
  • Figure 5: Results on synthetic MQAR task with settings in arora2023zoology.
  • ...and 1 more figures