Table of Contents
Fetching ...

Cross Resolution Encoding-Decoding For Detection Transformers

Ashish Kumar, Jaesik Park

TL;DR

This paper proposes a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection.

Abstract

Detection Transformers (DETR) are renowned object detection pipelines, however computationally efficient multiscale detection using DETR is still challenging. In this paper, we propose a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection. CRED is based on two modules; Cross Resolution Attention Module (CRAM) and One Step Multiscale Attention (OSMA). CRAM is designed to transfer the knowledge of low-resolution encoder output to a high-resolution feature. While OSMA is designed to fuse multiscale features in a single step and produce a feature map of a desired resolution enriched with multiscale information. When used in prominent DETR methods, CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs. Specifically, state-of-the-art DN-DETR, when used with CRED (calling CRED-DETR), becomes 76% faster, with ~50% reduced FLOPs than its high-resolution counterpart with 202 G FLOPs on MS-COCO benchmark. We plan to release pretrained CRED-DETRs for use by the community. Code: https://github.com/ashishkumar822/CRED-DETR

Cross Resolution Encoding-Decoding For Detection Transformers

TL;DR

This paper proposes a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection.

Abstract

Detection Transformers (DETR) are renowned object detection pipelines, however computationally efficient multiscale detection using DETR is still challenging. In this paper, we propose a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection. CRED is based on two modules; Cross Resolution Attention Module (CRAM) and One Step Multiscale Attention (OSMA). CRAM is designed to transfer the knowledge of low-resolution encoder output to a high-resolution feature. While OSMA is designed to fuse multiscale features in a single step and produce a feature map of a desired resolution enriched with multiscale information. When used in prominent DETR methods, CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs. Specifically, state-of-the-art DN-DETR, when used with CRED (calling CRED-DETR), becomes 76% faster, with ~50% reduced FLOPs than its high-resolution counterpart with 202 G FLOPs on MS-COCO benchmark. We plan to release pretrained CRED-DETRs for use by the community. Code: https://github.com/ashishkumar822/CRED-DETR
Paper Structure (13 sections, 6 figures, 7 tables)

This paper contains 13 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Left: Single-Scale and/ DC DETR. Middle: IMFA DETR imfa. Right: CRED DETR (Ours). Multiple arrows between two modules indicate layerwise refinement. Stage-$1$ features are generally not used due to large resolution and small receptive field.
  • Figure 2: CRAM: Cross Resolution Attention Module.
  • Figure 3: Local aggregation of the multiscale features in OSMA for $g_0=1$ which produces $Q \in \mathbb{R}^{N_g \times T \times C}$
  • Figure 4: One step attention and Output Broadcasting in OSMA. 'One-Step' should not be confused with 'single layer'; instead, it refers to all the multiscale features being attended simultaneously through $1\times 1$ layer. Output broadcasting infers the shape of the output feature based on the value of $P$.
  • Figure 5: Convergence plots over MS-COCO validation set. (a) It can be seen that despite having $50\%$ fewer FLOPs and $76\%$ higher FPS, CRED converges similarly to the baseline. (b) DETR with smaller backbone and their DC$\times0.25$ variants in $12$ epoch setting. Notice that even in the smaller backbone, CRED-enabled model with and without DC$\times0.25$ have similar accuracy, but this gap is noticeable in the baselines with and without DC$\times0.25$. This strengthens the utility of CRED that encoder input resolution can be aggressively dropped to save computations while having better accuracy.
  • ...and 1 more figures