Table of Contents
Fetching ...

Deformable Capsules for Object Detection

Rodney Lalonde, Naji Khosravan, Ulas Bagci

TL;DR

This paper tackles the challenge of applying capsule networks to large-scale object detection by introducing DeformCaps, a deformable capsule framework with SplitCaps and SE-Routing. It enables a one-stage, capsule-based detector that leverages deformable sampling and two specialized head types to model object instantiation and class presence efficiently, achieving competitive MS COCO results with fewer false positives. The key contributions are: (1) deformable capsules that relax rigid spatial constraints, (2) SplitCaps to scale capsule representations for many classes, and (3) SE-Routing to compute routing coefficients in a single forward pass. Overall, DeformCaps demonstrates that capsule-based object detection can reach CNN-level performance in a one-stage setting while improving robustness to unusual poses and viewpoints, with potential implications for efficiency and interpretability in vision systems.

Abstract

Capsule networks promise significant benefits over convolutional networks by storing stronger internal representations, and routing information based on the agreement between intermediate representations' projections. Despite this, their success has been limited to small-scale classification datasets due to their computationally expensive nature. Though memory efficient, convolutional capsules impose geometric constraints that fundamentally limit the ability of capsules to model the pose/deformation of objects. Further, they do not address the bigger memory concern of class-capsules scaling up to bigger tasks such as detection or large-scale classification. In this study, we introduce a new family of capsule networks, deformable capsules (\textit{DeformCaps}), to address a very important problem in computer vision: object detection. We propose two new algorithms associated with our \textit{DeformCaps}: a novel capsule structure (\textit{SplitCaps}), and a novel dynamic routing algorithm (\textit{SE-Routing}), which balance computational efficiency with the need for modeling a large number of objects and classes, which have never been achieved with capsule networks before. We demonstrate that the proposed methods efficiently scale up to create the first-ever capsule network for object detection in the literature. Our proposed architecture is a one-stage detection framework and it obtains results on MS COCO which are on par with state-of-the-art one-stage CNN-based methods, while producing fewer false positive detection, generalizing to unusual poses/viewpoints of objects.

Deformable Capsules for Object Detection

TL;DR

This paper tackles the challenge of applying capsule networks to large-scale object detection by introducing DeformCaps, a deformable capsule framework with SplitCaps and SE-Routing. It enables a one-stage, capsule-based detector that leverages deformable sampling and two specialized head types to model object instantiation and class presence efficiently, achieving competitive MS COCO results with fewer false positives. The key contributions are: (1) deformable capsules that relax rigid spatial constraints, (2) SplitCaps to scale capsule representations for many classes, and (3) SE-Routing to compute routing coefficients in a single forward pass. Overall, DeformCaps demonstrates that capsule-based object detection can reach CNN-level performance in a one-stage setting while improving robustness to unusual poses and viewpoints, with potential implications for efficiency and interpretability in vision systems.

Abstract

Capsule networks promise significant benefits over convolutional networks by storing stronger internal representations, and routing information based on the agreement between intermediate representations' projections. Despite this, their success has been limited to small-scale classification datasets due to their computationally expensive nature. Though memory efficient, convolutional capsules impose geometric constraints that fundamentally limit the ability of capsules to model the pose/deformation of objects. Further, they do not address the bigger memory concern of class-capsules scaling up to bigger tasks such as detection or large-scale classification. In this study, we introduce a new family of capsule networks, deformable capsules (\textit{DeformCaps}), to address a very important problem in computer vision: object detection. We propose two new algorithms associated with our \textit{DeformCaps}: a novel capsule structure (\textit{SplitCaps}), and a novel dynamic routing algorithm (\textit{SE-Routing}), which balance computational efficiency with the need for modeling a large number of objects and classes, which have never been achieved with capsule networks before. We demonstrate that the proposed methods efficiently scale up to create the first-ever capsule network for object detection in the literature. Our proposed architecture is a one-stage detection framework and it obtains results on MS COCO which are on par with state-of-the-art one-stage CNN-based methods, while producing fewer false positive detection, generalizing to unusual poses/viewpoints of objects.

Paper Structure

This paper contains 13 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Deformable capsule architecture for object detection.
  • Figure 2: Proposed deformable SplitCaps formulation which, in one-shot, localizes all objects to their centers, determines their classes, and models their instantiation parameters in two parallel parent capsule types. Information is dynamically routed from children to parents by passing three chosen child capsule descriptors through a two-layer Squeeze-and-Excitation (S&E) bottleneck.
  • Figure 3: Qualitative example for CenterNet zhou2019objects (leftmost), the proposed DeformCaps (center), and the ground-truth annotations (rightmost) on the MS COCO test-dev dataset. While CenterNet produces slightly higher average precision (AP) values than DeformCaps, it also seems to produce more false positives on average than DeformCaps.
  • Figure 4: CenterNet zhou2019objects (leftmost) consistently produces a higher number of false positives than the proposed DeformCaps (center), as compared to the ground-truth annotations (rightmost) on the MS COCO test-dev dataset.
  • Figure 5: Some more qualitative examples showing CenterNet zhou2019objects (leftmost) consistently producing a higher number of false positive detections than the proposed DeformCaps (center), as compared to the ground-truth annotations (rightmost) on the MS COCO test-dev dataset.
  • ...and 2 more figures