Hierarchical Object-Centric Learning with Capsule Networks

Riccardo Renzulli

Hierarchical Object-Centric Learning with Capsule Networks

Riccardo Renzulli

TL;DR

Convolutional neural networks struggle to preserve spatial relationships and explicit object representations, motivating capsule networks (CapsNets) that encode objects and parts as capsules with a routing mechanism for hierarchical part–whole relationships. The work investigates three core directions: routing annealing to improve CapsNets in small networks, efficient CapsNets via pruned backbones, and learning concise part–whole relations through low-entropy routing and pruning (REM). It demonstrates that routing annealing can boost generalization in parameter-constrained models, that pruning backbones reduces memory and computation while maintaining performance, and that low-entropy routing yields more discriminative parse trees with fewer spurious relations. The thesis further showcases CapsNets in real-world applications, including autonomous UAV localization under large appearance changes, quaternion-based rotation prediction in synthetic datasets, and lung nodule segmentation, underscoring object-centric representations' potential for robustness, interpretability, and scalability across vision tasks.

Abstract

Capsule networks (CapsNets) were introduced to address convolutional neural networks limitations, learning object-centric representations that are more robust, pose-aware, and interpretable. They organize neurons into groups called capsules, where each capsule encodes the instantiation parameters of an object or one of its parts. Moreover, a routing algorithm connects capsules in different layers, thereby capturing hierarchical part-whole relationships in the data. This thesis investigates the intriguing aspects of CapsNets and focuses on three key questions to unlock their full potential. First, we explore the effectiveness of the routing algorithm, particularly in small-sized networks. We propose a novel method that anneals the number of routing iterations during training, enhancing performance in architectures with fewer parameters. Secondly, we investigate methods to extract more effective first-layer capsules, also known as primary capsules. By exploiting pruned backbones, we aim to improve computational efficiency by reducing the number of capsules while achieving high generalization. This approach reduces CapsNets memory requirements and computational effort. Third, we explore part-relationship learning in CapsNets. Through extensive research, we demonstrate that capsules with low entropy can extract more concise and discriminative part-whole relationships compared to traditional capsule networks, even with reasonable network sizes. Lastly, we showcase how CapsNets can be utilized in real-world applications, including autonomous localization of unmanned aerial vehicles, quaternion-based rotations prediction in synthetic datasets, and lung nodule segmentation in biomedical imaging. The findings presented in this thesis contribute to a deeper understanding of CapsNets and highlight their potential to address complex computer vision challenges.

Hierarchical Object-Centric Learning with Capsule Networks

TL;DR

Abstract

Paper Structure (124 sections, 53 equations, 53 figures, 19 tables, 3 algorithms)

This paper contains 124 sections, 53 equations, 53 figures, 19 tables, 3 algorithms.

Introduction
Learning with Capsule Networks
The limitations of Convolutional Networks
Coordinate frames
Invariance and Equivariance
Linear Manifold
Routing
Background on Capsule Networks
Notation
What are capsules
Capsule Networks Fundamentals
General Architecture
Dynamic Routing
Drawbacks
Capsule Networks Follow-Ups
...and 109 more sections

Figures (53)

Figure 1: Different arrangements of the same components can produce different objects.
Figure 2: The same object will look different depending upon the coordinate frame imposed (a duck if the front points to the left, a rabbit if the front points to the right).
Figure 3: Example of a scene graph.
Figure 4: Max-pooling over spatial regions produces invariance to translation but not to other transformations, such as rotations. If we pool over the outputs of separately parametrized convolutions, the features can learn which transformations to become invariant. Here we show how a CNN can learn rotation invariances thanks to pooling layers applied to many feature detectors.
Figure 5: Lung nodules in CT scans (left) and histopathology slices (right) have translation, rotation, reflection, and scaling symmetries.
...and 48 more figures

Hierarchical Object-Centric Learning with Capsule Networks

TL;DR

Abstract

Hierarchical Object-Centric Learning with Capsule Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (53)