Table of Contents
Fetching ...

CornerNet-Lite: Efficient Keypoint Based Object Detection

Hei Law, Yun Teng, Olga Russakovsky, Jia Deng

TL;DR

This paper tackles the slow inference of anchor-free, keypoint-based detectors by introducing CornerNet-Lite, which combines CornerNet-Saccade (attention-guided, offline-efficient) and CornerNet-Squeeze (compact-backbone, real-time-efficient). CornerNet-Saccade uses multi-scale attention maps and crops high-resolution regions to achieve a 6× speed-up with a modest 1% AP gain, while CornerNet-Squeeze employs SqueezeNet-inspired fire modules and depthwise separable convolutions to surpass YOLOv3 in both speed (30 ms) and accuracy (34.4% AP on COCO). Ablation studies show saccades help only when attention maps are accurate and when the network has sufficient capacity, and that combining Squeeze with Saccade is not beneficial under tight budgets. Overall, CornerNet-Lite demonstrates that keypoint-based detection can meet practical efficiency and real-time constraints, expanding its applicability to time-sensitive tasks.

Abstract

Keypoint-based methods are a relatively new paradigm in object detection, eliminating the need for anchor boxes and offering a simplified detection framework. Keypoint-based CornerNet achieves state of the art accuracy among single-stage detectors. However, this accuracy comes at high processing cost. In this work, we tackle the problem of efficient keypoint-based object detection and introduce CornerNet-Lite. CornerNet-Lite is a combination of two efficient variants of CornerNet: CornerNet-Saccade, which uses an attention mechanism to eliminate the need for exhaustively processing all pixels of the image, and CornerNet-Squeeze, which introduces a new compact backbone architecture. Together these two variants address the two critical use cases in efficient object detection: improving efficiency without sacrificing accuracy, and improving accuracy at real-time efficiency. CornerNet-Saccade is suitable for offline processing, improving the efficiency of CornerNet by 6.0x and the AP by 1.0% on COCO. CornerNet-Squeeze is suitable for real-time detection, improving both the efficiency and accuracy of the popular real-time detector YOLOv3 (34.4% AP at 30ms for CornerNet-Squeeze compared to 33.0% AP at 39ms for YOLOv3 on COCO). Together these contributions for the first time reveal the potential of keypoint-based detection to be useful for applications requiring processing efficiency.

CornerNet-Lite: Efficient Keypoint Based Object Detection

TL;DR

This paper tackles the slow inference of anchor-free, keypoint-based detectors by introducing CornerNet-Lite, which combines CornerNet-Saccade (attention-guided, offline-efficient) and CornerNet-Squeeze (compact-backbone, real-time-efficient). CornerNet-Saccade uses multi-scale attention maps and crops high-resolution regions to achieve a 6× speed-up with a modest 1% AP gain, while CornerNet-Squeeze employs SqueezeNet-inspired fire modules and depthwise separable convolutions to surpass YOLOv3 in both speed (30 ms) and accuracy (34.4% AP on COCO). Ablation studies show saccades help only when attention maps are accurate and when the network has sufficient capacity, and that combining Squeeze with Saccade is not beneficial under tight budgets. Overall, CornerNet-Lite demonstrates that keypoint-based detection can meet practical efficiency and real-time constraints, expanding its applicability to time-sensitive tasks.

Abstract

Keypoint-based methods are a relatively new paradigm in object detection, eliminating the need for anchor boxes and offering a simplified detection framework. Keypoint-based CornerNet achieves state of the art accuracy among single-stage detectors. However, this accuracy comes at high processing cost. In this work, we tackle the problem of efficient keypoint-based object detection and introduce CornerNet-Lite. CornerNet-Lite is a combination of two efficient variants of CornerNet: CornerNet-Saccade, which uses an attention mechanism to eliminate the need for exhaustively processing all pixels of the image, and CornerNet-Squeeze, which introduces a new compact backbone architecture. Together these two variants address the two critical use cases in efficient object detection: improving efficiency without sacrificing accuracy, and improving accuracy at real-time efficiency. CornerNet-Saccade is suitable for offline processing, improving the efficiency of CornerNet by 6.0x and the AP by 1.0% on COCO. CornerNet-Squeeze is suitable for real-time detection, improving both the efficiency and accuracy of the popular real-time detector YOLOv3 (34.4% AP at 30ms for CornerNet-Squeeze compared to 33.0% AP at 39ms for YOLOv3 on COCO). Together these contributions for the first time reveal the potential of keypoint-based detection to be useful for applications requiring processing efficiency.

Paper Structure

This paper contains 12 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: We introduce CornerNet-Saccade and CornerNet-Squeeze (collectively as CornerNet-Lite), two efficient object detectors based on CornerNet law2018cornernet, a state-of-the-art keypoint based object detector. CornerNet-Saccade speeds up the original CornerNet by 6.0x with a 1% increase in AP. CornerNet-Squeeze is faster and more accurate than YOLOv3 redmon2018yolov3, the state-of-the-art real time detector. All detectors are tested on the same machine with a 1080Ti GPU and an Intel Core i7-7700k CPU.
  • Figure 2: Overview of CornerNet-Saccade. We predict a set of possible object locations from the attention maps and bounding boxes generated on a downsized full image. We zoom into each location and crop a small region around that location. Then we detect objects in top $k$ regions and merge the detections by NMS.
  • Figure 3: Left: Some objects may not be fully covered by a region. The detector may still generate bounding boxes (red dashed line) for those objects. We remove the bounding boxes which touch the boundaries to avoid such bounding boxes. Right: When the objects are close to each other, we may generate regions that highly overlap with each other. Processing either one of them is likely to detect objects in all highly overlapping regions. We suppress redundant regions to improve efficiency.