Table of Contents
Fetching ...

Masked Capsule Autoencoders

Miles Everett, Mingjun Zhong, Georgios Leontidis

TL;DR

This paper addresses scaling Capsule Networks to realistically sized images by introducing Masked Capsule Autoencoders (MCAE), a self-supervised pretraining framework based on masked image modelling. MCAE flattens capsule feature maps to a 1D sequence, applies patch masking, and uses a learnable masked token with a fully capsule-based decoder during pretraining before finetuning with a class capsule layer. Across multiple datasets, MCAE with masked pretraining achieves state-of-the-art results among Capsule Networks, notably a substantial gain on Imagenette (approximately 9 percentage points over the baseline) and strong performance on Imagewoof, while ConvMixer backbones outperform ViT backbones in this setting. The approach preserves capsule advantages such as viewpoint consistency and generalizes well to novel viewpoints, though the decoder introduces significant pretraining compute. Overall, MCAE demonstrates that self-supervised pretraining is a viable path to scale Capsule Networks to modern, realistically sized image tasks.

Abstract

We propose Masked Capsule Autoencoders (MCAE), the first Capsule Network that utilises pretraining in a modern self-supervised paradigm, specifically the masked image modelling framework. Capsule Networks have emerged as a powerful alternative to Convolutional Neural Networks (CNNs). They have shown favourable properties when compared to Vision Transformers (ViT), but have struggled to effectively learn when presented with more complex data. This has led to Capsule Network models that do not scale to modern tasks. Our proposed MCAE model alleviates this issue by reformulating the Capsule Network to use masked image modelling as a pretraining stage before finetuning in a supervised manner. Across several experiments and ablations studies we demonstrate that similarly to CNNs and ViTs, Capsule Networks can also benefit from self-supervised pretraining, paving the way for further advancements in this neural network domain. For instance, by pretraining on the Imagenette dataset-consisting of 10 classes of Imagenet-sized images-we achieve state-of-the-art results for Capsule Networks, demonstrating a 9% improvement compared to our baseline model. Thus, we propose that Capsule Networks benefit from and should be trained within a masked image modelling framework, using a novel capsule decoder, to enhance a Capsule Network's performance on realistically sized images.

Masked Capsule Autoencoders

TL;DR

This paper addresses scaling Capsule Networks to realistically sized images by introducing Masked Capsule Autoencoders (MCAE), a self-supervised pretraining framework based on masked image modelling. MCAE flattens capsule feature maps to a 1D sequence, applies patch masking, and uses a learnable masked token with a fully capsule-based decoder during pretraining before finetuning with a class capsule layer. Across multiple datasets, MCAE with masked pretraining achieves state-of-the-art results among Capsule Networks, notably a substantial gain on Imagenette (approximately 9 percentage points over the baseline) and strong performance on Imagewoof, while ConvMixer backbones outperform ViT backbones in this setting. The approach preserves capsule advantages such as viewpoint consistency and generalizes well to novel viewpoints, though the decoder introduces significant pretraining compute. Overall, MCAE demonstrates that self-supervised pretraining is a viable path to scale Capsule Networks to modern, realistically sized image tasks.

Abstract

We propose Masked Capsule Autoencoders (MCAE), the first Capsule Network that utilises pretraining in a modern self-supervised paradigm, specifically the masked image modelling framework. Capsule Networks have emerged as a powerful alternative to Convolutional Neural Networks (CNNs). They have shown favourable properties when compared to Vision Transformers (ViT), but have struggled to effectively learn when presented with more complex data. This has led to Capsule Network models that do not scale to modern tasks. Our proposed MCAE model alleviates this issue by reformulating the Capsule Network to use masked image modelling as a pretraining stage before finetuning in a supervised manner. Across several experiments and ablations studies we demonstrate that similarly to CNNs and ViTs, Capsule Networks can also benefit from self-supervised pretraining, paving the way for further advancements in this neural network domain. For instance, by pretraining on the Imagenette dataset-consisting of 10 classes of Imagenet-sized images-we achieve state-of-the-art results for Capsule Networks, demonstrating a 9% improvement compared to our baseline model. Thus, we propose that Capsule Networks benefit from and should be trained within a masked image modelling framework, using a novel capsule decoder, to enhance a Capsule Network's performance on realistically sized images.
Paper Structure (21 sections, 5 equations, 7 figures, 5 tables)

This paper contains 21 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Our Masked Capsule Autoencoder architecture. During pretraining we randomly select a number of patches from the original image to be processed. The Capsule Network will then create a representation for each patch. Masked patch capsule representations are then re-added before the capsule decoder, where the unmasked capsules can contribute to the masked positions, which are finally decoded by a single linear layer to the original patch dimensions. The pretraining objective is the mean squared error between the reconstructed patches and the target patches. The dog image used is sourced from the Imagewoof validation set Howard_Imagewoof_2019.
  • Figure 2: A visual representation of how a 2D patch feature map or capsule feature map with height and width is flattened into a 1D feature map with a length instead. At each location, there is the same amount of different capsule types, each corresponding to a different part or concept in the part-whole parse tree. The dog image used is sourced from the Imagewoof validation set Howard_Imagewoof_2019.
  • Figure 3: A visual representation of the masking process. An image is split into non-overlapping patches of $N \times N$ pixels. Randomly, a percentage, in this case, 50% of patches are removed to deprive the network of information available in these patches. The patches are then flattened into a 1D sequence of the remaining patches, ready to be processed by our encoder. The dog image used is sourced from the Imagewoof validation set Howard_Imagewoof_2019.
  • Figure 4: A visual representation of how our pretrain loss function selects patches for the loss function defined in equation \ref{['eqn:mse']}. The dog image used is sourced from the Imagewoof validation set Howard_Imagewoof_2019.
  • Figure 5: A visual depiction of the pretrain and finetuning components. We show how the feature extracting CNN and capsule encoder are kept from the pretrain to finetune step. The capsule decoder is discarded after pretraining and replaced with a class capsules layer that maps the capsule encoder network to a classification output.
  • ...and 2 more figures