Masked Capsule Autoencoders
Miles Everett, Mingjun Zhong, Georgios Leontidis
TL;DR
This paper addresses scaling Capsule Networks to realistically sized images by introducing Masked Capsule Autoencoders (MCAE), a self-supervised pretraining framework based on masked image modelling. MCAE flattens capsule feature maps to a 1D sequence, applies patch masking, and uses a learnable masked token with a fully capsule-based decoder during pretraining before finetuning with a class capsule layer. Across multiple datasets, MCAE with masked pretraining achieves state-of-the-art results among Capsule Networks, notably a substantial gain on Imagenette (approximately 9 percentage points over the baseline) and strong performance on Imagewoof, while ConvMixer backbones outperform ViT backbones in this setting. The approach preserves capsule advantages such as viewpoint consistency and generalizes well to novel viewpoints, though the decoder introduces significant pretraining compute. Overall, MCAE demonstrates that self-supervised pretraining is a viable path to scale Capsule Networks to modern, realistically sized image tasks.
Abstract
We propose Masked Capsule Autoencoders (MCAE), the first Capsule Network that utilises pretraining in a modern self-supervised paradigm, specifically the masked image modelling framework. Capsule Networks have emerged as a powerful alternative to Convolutional Neural Networks (CNNs). They have shown favourable properties when compared to Vision Transformers (ViT), but have struggled to effectively learn when presented with more complex data. This has led to Capsule Network models that do not scale to modern tasks. Our proposed MCAE model alleviates this issue by reformulating the Capsule Network to use masked image modelling as a pretraining stage before finetuning in a supervised manner. Across several experiments and ablations studies we demonstrate that similarly to CNNs and ViTs, Capsule Networks can also benefit from self-supervised pretraining, paving the way for further advancements in this neural network domain. For instance, by pretraining on the Imagenette dataset-consisting of 10 classes of Imagenet-sized images-we achieve state-of-the-art results for Capsule Networks, demonstrating a 9% improvement compared to our baseline model. Thus, we propose that Capsule Networks benefit from and should be trained within a masked image modelling framework, using a novel capsule decoder, to enhance a Capsule Network's performance on realistically sized images.
