Cross Resolution Encoding-Decoding For Detection Transformers

Ashish Kumar; Jaesik Park

Cross Resolution Encoding-Decoding For Detection Transformers

Ashish Kumar, Jaesik Park

TL;DR

This paper proposes a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection.

Abstract

Detection Transformers (DETR) are renowned object detection pipelines, however computationally efficient multiscale detection using DETR is still challenging. In this paper, we propose a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection. CRED is based on two modules; Cross Resolution Attention Module (CRAM) and One Step Multiscale Attention (OSMA). CRAM is designed to transfer the knowledge of low-resolution encoder output to a high-resolution feature. While OSMA is designed to fuse multiscale features in a single step and produce a feature map of a desired resolution enriched with multiscale information. When used in prominent DETR methods, CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs. Specifically, state-of-the-art DN-DETR, when used with CRED (calling CRED-DETR), becomes 76% faster, with ~50% reduced FLOPs than its high-resolution counterpart with 202 G FLOPs on MS-COCO benchmark. We plan to release pretrained CRED-DETRs for use by the community. Code: https://github.com/ashishkumar822/CRED-DETR

Cross Resolution Encoding-Decoding For Detection Transformers

TL;DR

This paper proposes a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection.

Abstract

Paper Structure (13 sections, 6 figures, 7 tables)

This paper contains 13 sections, 6 figures, 7 tables.

Introduction
Preliminary
Method
Cross Resolution Attention Module (CRAM)
Computation Complexity
Why Could Cross Resolution Attention Transfer Improve Performance?
One Step Multiscale Attention (OSMA)
Configuring CRED for DETRs
Experiments
Main Results
Ablations
Conclusion
Detection Visualizations on MS-COCO Validation Set

Figures (6)

Figure 1: Left: Single-Scale and/ DC DETR. Middle: IMFA DETR imfa. Right: CRED DETR (Ours). Multiple arrows between two modules indicate layerwise refinement. Stage-$1$ features are generally not used due to large resolution and small receptive field.
Figure 2: CRAM: Cross Resolution Attention Module.
Figure 3: Local aggregation of the multiscale features in OSMA for $g_0=1$ which produces $Q \in \mathbb{R}^{N_g \times T \times C}$
Figure 4: One step attention and Output Broadcasting in OSMA. 'One-Step' should not be confused with 'single layer'; instead, it refers to all the multiscale features being attended simultaneously through $1\times 1$ layer. Output broadcasting infers the shape of the output feature based on the value of $P$.
Figure 5: Convergence plots over MS-COCO validation set. (a) It can be seen that despite having $50\%$ fewer FLOPs and $76\%$ higher FPS, CRED converges similarly to the baseline. (b) DETR with smaller backbone and their DC$\times0.25$ variants in $12$ epoch setting. Notice that even in the smaller backbone, CRED-enabled model with and without DC$\times0.25$ have similar accuracy, but this gap is noticeable in the baselines with and without DC$\times0.25$. This strengthens the utility of CRED that encoder input resolution can be aggressively dropped to save computations while having better accuracy.
...and 1 more figures

Cross Resolution Encoding-Decoding For Detection Transformers

TL;DR

Abstract

Cross Resolution Encoding-Decoding For Detection Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)