Table of Contents
Fetching ...

AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation

Anil Kag, Huseyin Coskun, Jierun Chen, Junli Cao, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Jian Ren

TL;DR

This work revisits the key design principles of hybrid architectures and proposes a simple and effective AsCAN, a hybrid architecture, combining both convolutional and transformer blocks, which supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency.

Abstract

Neural network architecture design requires making many crucial decisions. The common desiderata is that similar decisions, with little modifications, can be reused in a variety of tasks and applications. To satisfy that, architectures must provide promising latency and performance trade-offs, support a variety of tasks, scale efficiently with respect to the amounts of data and compute, leverage available data from other tasks, and efficiently support various hardware. To this end, we introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. We revisit the key design principles of hybrid architectures and propose a simple and effective \emph{asymmetric} architecture, where the distribution of convolutional and transformer blocks is \emph{asymmetric}, containing more convolutional blocks in the earlier stages, followed by more transformer blocks in later stages. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models. Notably, even without any computation optimization for transformer blocks, our models still yield faster inference speed than existing works featuring efficient attention mechanisms, highlighting the advantages and the value of our approach.

AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation

TL;DR

This work revisits the key design principles of hybrid architectures and proposes a simple and effective AsCAN, a hybrid architecture, combining both convolutional and transformer blocks, which supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency.

Abstract

Neural network architecture design requires making many crucial decisions. The common desiderata is that similar decisions, with little modifications, can be reused in a variety of tasks and applications. To satisfy that, architectures must provide promising latency and performance trade-offs, support a variety of tasks, scale efficiently with respect to the amounts of data and compute, leverage available data from other tasks, and efficiently support various hardware. To this end, we introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. We revisit the key design principles of hybrid architectures and propose a simple and effective \emph{asymmetric} architecture, where the distribution of convolutional and transformer blocks is \emph{asymmetric}, containing more convolutional blocks in the earlier stages, followed by more transformer blocks in later stages. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models. Notably, even without any computation optimization for transformer blocks, our models still yield faster inference speed than existing works featuring efficient attention mechanisms, highlighting the advantages and the value of our approach.

Paper Structure

This paper contains 29 sections, 3 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Example images generated by our efficient text-to-image generation model based on an asymmetric architecture. It generates photo-realistic images while following long prompts.
  • Figure 2: Example AsCAN architectures for Image Classification & Text-to-Image Generation.(a): The architecture for the image classification and details of the convolutional (C) and transformer blocks (T). AsCAN includes Stem (consisting of convolutional layers) and four stages followed by pooling and classifier. (b): The UNet architecture for the image generation. The Down blocks (the first three blocks starting from left) have the reverted reflection as the Up blocks (the first three blocks starting from right). (c): The details for C and T used in UNet. For the T that performs the cross attention between latent image features and textural embedding, the $Q$ matrix comes from the textural embedding. Note that, compared to image classification, the C and T blocks for image generation only adds extra components to incorporate the input time-step and textual embeddings.
  • Figure 3: Top-1 Accuracy vs Inference Latency on ImageNet-1K Classification. We plot the latency measured as images inferred per second on a single V100 GPU (Left)/A100 GPU (Right) with batch-size $16$ with $224\times 224$ resolution. The plot compares state-of-the-art models (convolutional, transformer, hybrid architectures) against the proposed AsCAN architecture. The area of each circle is proportional to the model size. Our model consistently achieves better accuracy vs latency trade-offs. While some models regress between two hardware (e.g., MaxViT-S vs SMT-B ), our model consistently achieves better accuracy vs latency trade-offs. We report additional baselines along with multiply-add operations count and different batch sizes in Appendix Tab. \ref{['table:imagenet_results']}.
  • Figure 4: Qualitative Comparison against open source and commercial models. We compare our T2I model against generations from different baselines. We illustrate that many times existing models generate images with less photo-realism (either lot less details or more on the cartoonish side), specially for PixArt-$\alpha$ and PixArt-$\Sigma$. Further, they frequently miss the fine-grained details explicitly asked in the prompts. We highlight these mistakes in red color in the input prompt. For instance, in the above generations (ordered A $\to$ F from top to bottom row), baselines miss details such as, (A) lack of realism (B) light blue jeans, (C) white sunglasses, (D) black, orange, and white feathers, (E) grey scarf & back towards camera, and (F) gray knitted hat with dark blue-brown patterns.
  • Figure 5: Image-Text Alignment Study. We perform user study for $1000$ prompts and ask them to choose images with better image-text alignment. It shows that we outperform SDXL and PixArt-$\alpha$. While our performance is on par with PixArt-$\Sigma$, Tab. \ref{['table:treasure_data_fid']} shows that we yield more realistic generations.
  • ...and 6 more figures