Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
Mahdi Karami, Ali Ghodsi
TL;DR
This paper tackles the quadratic complexity of self-attention by introducing Orchid, a data-dependent global convolution whose kernel adapts to the input via conditioning networks. By generating a long, input-conditioned kernel and convolving in the frequency domain, Orchid achieves quasilinear complexity ${O(M L \log L)}$ while preserving the ability to model long-range dependencies, aided by shift-equivariant conditioning and gating. The approach is validated across synthetic in-context learning, language modeling, and image classification, with Orchid-based BERT and ViT variants achieving competitive or superior accuracy using fewer parameters and longer effective context than dense attention baselines. These results suggest that data-dependent global convolutions can provide a scalable, expressive alternative to attention in foundation models, with broad applicability beyond NLP to vision and other domains; code is openly available to facilitate adoption.
Abstract
In the rapidly evolving field of deep learning, the demand for models that are both expressive and computationally efficient has never been more critical. This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms without compromising the ability to capture long-range dependencies and in-context learning. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its kernel conditioned on input sequence using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in our data-dependent convolution operation. The dynamic nature of the proposed convolution kernel grants Orchid high expressivity while maintaining quasilinear scalability for long sequences. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality. Our experiments demonstrate that this architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. This achievement represents a significant step towards more efficient and scalable deep learning models for sequence modeling. The code is available at https://github.com/Karami-m/orchid.
