Table of Contents
Fetching ...

Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling

Mahdi Karami, Ali Ghodsi

TL;DR

This paper tackles the quadratic complexity of self-attention by introducing Orchid, a data-dependent global convolution whose kernel adapts to the input via conditioning networks. By generating a long, input-conditioned kernel and convolving in the frequency domain, Orchid achieves quasilinear complexity ${O(M L \log L)}$ while preserving the ability to model long-range dependencies, aided by shift-equivariant conditioning and gating. The approach is validated across synthetic in-context learning, language modeling, and image classification, with Orchid-based BERT and ViT variants achieving competitive or superior accuracy using fewer parameters and longer effective context than dense attention baselines. These results suggest that data-dependent global convolutions can provide a scalable, expressive alternative to attention in foundation models, with broad applicability beyond NLP to vision and other domains; code is openly available to facilitate adoption.

Abstract

In the rapidly evolving field of deep learning, the demand for models that are both expressive and computationally efficient has never been more critical. This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms without compromising the ability to capture long-range dependencies and in-context learning. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its kernel conditioned on input sequence using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in our data-dependent convolution operation. The dynamic nature of the proposed convolution kernel grants Orchid high expressivity while maintaining quasilinear scalability for long sequences. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality. Our experiments demonstrate that this architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. This achievement represents a significant step towards more efficient and scalable deep learning models for sequence modeling. The code is available at https://github.com/Karami-m/orchid.

Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling

TL;DR

This paper tackles the quadratic complexity of self-attention by introducing Orchid, a data-dependent global convolution whose kernel adapts to the input via conditioning networks. By generating a long, input-conditioned kernel and convolving in the frequency domain, Orchid achieves quasilinear complexity while preserving the ability to model long-range dependencies, aided by shift-equivariant conditioning and gating. The approach is validated across synthetic in-context learning, language modeling, and image classification, with Orchid-based BERT and ViT variants achieving competitive or superior accuracy using fewer parameters and longer effective context than dense attention baselines. These results suggest that data-dependent global convolutions can provide a scalable, expressive alternative to attention in foundation models, with broad applicability beyond NLP to vision and other domains; code is openly available to facilitate adoption.

Abstract

In the rapidly evolving field of deep learning, the demand for models that are both expressive and computationally efficient has never been more critical. This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms without compromising the ability to capture long-range dependencies and in-context learning. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its kernel conditioned on input sequence using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in our data-dependent convolution operation. The dynamic nature of the proposed convolution kernel grants Orchid high expressivity while maintaining quasilinear scalability for long sequences. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality. Our experiments demonstrate that this architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. This achievement represents a significant step towards more efficient and scalable deep learning models for sequence modeling. The code is available at https://github.com/Karami-m/orchid.
Paper Structure (33 sections, 13 equations, 6 figures, 8 tables)

This paper contains 33 sections, 13 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 2.1: Orchid block architecture. This diagram illustrates the structure of the Orchid block. The core operation is a convolution (denoted by $*$), efficiently implemented in the frequency domain using FFT. Element-wise multiplication is denoted by $\odot$. On the right side, two different conditioning networks, introduced in equations (\ref{['eq:adaconv1']}) and (\ref{['eq:adaconv2']}) as shift-invariant convolution kernels, are illustrated. In this model, the convolution is performed efficiently in the spectral domain, so the kernel in the frequency domain, $h^{\mathcal{F}} = h_0^{\mathcal{F}} + h^{\mathcal{F}}_{\theta}({\bm{x}})$, is computed. The block also includes MLPs for linear projection and pointwise mixing of features at the beginning, that is common design choice used in various sequence modeling architectures.
  • Figure 4.1: The performance (test accuracy) of in-context learning on the associative recall task with different sequence lengths and a vocabulary size of 20. The results for the baseline models are drawn from poli2023hyenafu2023monarch2. The symbol ✘ indicates that the Transformer model failed to complete the task within a week or the model does not fit in memory.
  • Figure C.1: Comparison of Local Conv1D Choices: Evaluation of different local convolution options used in the conditioning network. Conditioning networks of type I (Equation \ref{['eq:adaconv1']}) (1 layer Conv1D in time + 1 layer in frequency), 2 layer Conv1D in time, 2 layer Conv1D in frequency, and 3 layer Conv1D in time + 3 layer in frequency.
  • Figure C.2: Comparison of different $\sigma()$ on conditioning network of Type II (cross-correlation in equation \ref{['eq:adaconv2_nonlinearity']}).
  • Figure C.3: Test accuracy of in-context learning on the associative recall task with a vocabulary size of 20 and sequence length of 128, comparing different model components. Type I refers to conditioning networks of type I (based on absolute value in Equation \ref{['eq:adaconv1']}). Orthonormal indicates transforms that utilize orthogonal and normalized bases.
  • ...and 1 more figures