Table of Contents
Fetching ...

QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Wenjun Huang, Tamoghno Das, Suyeon Jang, Mohsen Imani

TL;DR

Deformable transformer inference suffers from irregular memory accesses that limit hardware efficiency. QUILL couples a schedule-aware DOOQ-based prefetch with a fused MSDeformAttn core to transform sparse sampling into cache-friendly, single-pass computation, preserving model accuracy. The RTL-based accelerator achieves up to 7.29× throughput and 47.3× energy efficiency over a top-tier GPU, and outperforms prior accelerators by up to 9.82× in throughput while maintaining FP32-level accuracy under mixed precision. By converting sparsity to locality and locality to utilization, QUILL demonstrates robust end-to-end speedups for Deformable DETR variants and establishes a scalable, co-design paradigm for sparse attention in vision transformers.

Abstract

Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.

QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

TL;DR

Deformable transformer inference suffers from irregular memory accesses that limit hardware efficiency. QUILL couples a schedule-aware DOOQ-based prefetch with a fused MSDeformAttn core to transform sparse sampling into cache-friendly, single-pass computation, preserving model accuracy. The RTL-based accelerator achieves up to 7.29× throughput and 47.3× energy efficiency over a top-tier GPU, and outperforms prior accelerators by up to 9.82× in throughput while maintaining FP32-level accuracy under mixed precision. By converting sparsity to locality and locality to utilization, QUILL demonstrates robust end-to-end speedups for Deformable DETR variants and establishes a scalable, co-design paradigm for sparse attention in vision transformers.

Abstract

Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.

Paper Structure

This paper contains 16 sections, 3 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Landscape of SOTA object detection models.
  • Figure 2: FLOPs vs. latency on RTX 4090 comparing the backbone to the deformable transformer block.
  • Figure 3: GPU performance breakdowns of Deformable DETR and query-pruned models on RTX 4090. (a) Latency across internal blocks. (b) Roofline view of bottlenecks and mitigation levers.
  • Figure 4: MSDeformAttn dataflow linked to Eq. \ref{['eq:deform_atten']}, with hardware bottlenecks -- marked at the memory touchpoints: cache misses from scattered samples; low arithmetic intensity (few ops per byte); bank conflicts around fractional 2$\times$2 fetches.
  • Figure 5: Sparsity amplifies irregular access. (a) Top-$k$ pruning cuts FLOPs but scatters queries, reducing cache hit rate. (b) Sparse encoder (top-$\rho\%$). (c) Decoder using top-$N_d$ tokens. Both induce less cache-friendly patterns.
  • ...and 7 more figures