Table of Contents
Fetching ...

DECO: Unleashing the Potential of ConvNets for Query-based Detection and Segmentation

Xinghao Chen, Siwei Li, Yijing Yang, Yunhe Wang

TL;DR

This work tackles the efficiency and architectural design of query-based detection and segmentation by replacing attention with a convolution-based interaction mechanism called InterConv. Built as a ConvNet-based Detection ConvNet (DECO), the model uses a CNN backbone, a ConvNet encoder, and an InterConv decoder to produce a fixed set of predictions without NMS or positional encodings. DECO achieves competitive COCO results (e.g., $AP$ values around $40$–$48\%$ with believable FPS) and surpasses several transformer-based variants in speed, while remaining deployment-friendly. The approach also extends to the Segment Anything task via DECO-TinySAM, demonstrating cross-domain applicability and efficiency on mobile hardware, highlighting ConvNets as a viable alternative for DETR-like architectures.

Abstract

Transformer and its variants have shown great potential for various vision tasks in recent years, including image classification, object detection and segmentation. Meanwhile, recent studies also reveal that with proper architecture design, convolutional networks (ConvNets) also achieve competitive performance with transformers. However, no prior methods have explored to utilize pure convolution to build a Transformer-style Decoder module, which is essential for Encoder-Decoder architecture like Detection Transformer (DETR). To this end, in this paper we explore whether we could build query-based detection and segmentation framework with ConvNets instead of sophisticated transformer architecture. We propose a novel mechanism dubbed InterConv to perform interaction between object queries and image features via convolutional layers. Equipped with the proposed InterConv, we build Detection ConvNet (DECO), which is composed of a backbone and convolutional encoder-decoder architecture. We compare the proposed DECO against prior detectors on the challenging COCO benchmark. Despite its simplicity, our DECO achieves competitive performance in terms of detection accuracy and running speed. Specifically, with the ResNet-18 and ResNet-50 backbone, our DECO achieves $40.5\%$ and $47.8\%$ AP with $66$ and $34$ FPS, respectively. The proposed method is also evaluated on the segment anything task, demonstrating similar performance and higher efficiency. We hope the proposed method brings another perspective for designing architectures for vision tasks. Codes are available at https://github.com/xinghaochen/DECO and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/DECO.

DECO: Unleashing the Potential of ConvNets for Query-based Detection and Segmentation

TL;DR

This work tackles the efficiency and architectural design of query-based detection and segmentation by replacing attention with a convolution-based interaction mechanism called InterConv. Built as a ConvNet-based Detection ConvNet (DECO), the model uses a CNN backbone, a ConvNet encoder, and an InterConv decoder to produce a fixed set of predictions without NMS or positional encodings. DECO achieves competitive COCO results (e.g., values around with believable FPS) and surpasses several transformer-based variants in speed, while remaining deployment-friendly. The approach also extends to the Segment Anything task via DECO-TinySAM, demonstrating cross-domain applicability and efficiency on mobile hardware, highlighting ConvNets as a viable alternative for DETR-like architectures.

Abstract

Transformer and its variants have shown great potential for various vision tasks in recent years, including image classification, object detection and segmentation. Meanwhile, recent studies also reveal that with proper architecture design, convolutional networks (ConvNets) also achieve competitive performance with transformers. However, no prior methods have explored to utilize pure convolution to build a Transformer-style Decoder module, which is essential for Encoder-Decoder architecture like Detection Transformer (DETR). To this end, in this paper we explore whether we could build query-based detection and segmentation framework with ConvNets instead of sophisticated transformer architecture. We propose a novel mechanism dubbed InterConv to perform interaction between object queries and image features via convolutional layers. Equipped with the proposed InterConv, we build Detection ConvNet (DECO), which is composed of a backbone and convolutional encoder-decoder architecture. We compare the proposed DECO against prior detectors on the challenging COCO benchmark. Despite its simplicity, our DECO achieves competitive performance in terms of detection accuracy and running speed. Specifically, with the ResNet-18 and ResNet-50 backbone, our DECO achieves and AP with and FPS, respectively. The proposed method is also evaluated on the segment anything task, demonstrating similar performance and higher efficiency. We hope the proposed method brings another perspective for designing architectures for vision tasks. Codes are available at https://github.com/xinghaochen/DECO and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/DECO.
Paper Structure (16 sections, 3 equations, 8 figures, 9 tables)

This paper contains 16 sections, 3 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparisons of our proposed Detection ConvNets (DECO) and recent detectors on COCO val set. The latency is measured on a NVIDIA V100 GPU.
  • Figure 2: The attention-based decoder and our proposed InterConv. We abstract the general architecture of decoder and divide it into two components, i.e., Self-Interaction Module (SIM) and Cross-Interaction Module (CIM). In DETR, the SIM and CIM are implemented with multi-head self-attention and cross-attention mechanism, while in our proposed DECO, the SIM is stacked with simple depthwise and $1\times 1$ convolutions. We further propose a novel CIM mechanism for our DECO to perform interaction between object queries and image features via convolutional layers as well as simple upsampling and pooling operations.
  • Figure 3: The overall architecture of DETR carion2020end and our proposed Detection ConvNet (DECO). Our DECO is a simple yet effective query-based end-to-end object detection framework and enjoys the similar favorable attributes as DETR. Moreover, it is stacked with only standard convolutional layers and does not rely on any sophisticated attention modules.
  • Figure 3: Effect of different fusion methods.
  • Figure 4: Visualizations for box prompted segment anything for Our DECO-TinySAM ($1^{st}$ row) and TinySAM ($2^{nd}$ row). Our method obtains quite similar performance with TinySAM.
  • ...and 3 more figures