An Empirical Study on Improving SimCLR's Nonlinear Projection Head using Pretrained Autoencoder Embeddings
Andreas Schliebitz, Heiko Tapken, Martin Atzmueller
TL;DR
This study aims to enhance SimCLR's nonlinear projection head by substituting the input of the projector with pretrained autoencoder embeddings and freezing them, while systematically varying projector width and activation functions. The approach yields consistent accuracy gains (up to 2.9% on STL-10) and reduces the dimensionality of the projection space, with sigmoid and tanh activations often outperforming ReLU. Across five datasets and 144 models, frozen embeddings provide stability and improved average Acc@1 (≈1.7–1.8%), suggesting that unsupervised pretraining of the projection head can offer meaningful priors for contrastive learning. The work demonstrates a practical, architecture-lean method to boost performance and hints at future extensions to more advanced autoencoder architectures and attention-based projection heads.
Abstract
This paper focuses on improving the effectiveness of the standard 2-layer MLP projection head featured in the SimCLR framework through the use of pretrained autoencoder embeddings. Given a contrastive learning task with a largely unlabeled image classification dataset, we first train a shallow autoencoder architecture and extract its compressed representations contained in the encoder's embedding layer. After freezing the weights within this pretrained layer, we use it as a drop-in replacement for the input layer of SimCLR's default projector. Additionally, we also apply further architectural changes to the projector by decreasing its width and changing its activation function. The different projection heads are then used to contrastively train and evaluate a feature extractor following the SimCLR protocol. Our experiments indicate that using a pretrained autoencoder embedding in the projector can not only increase classification accuracy by up to 2.9% or 1.7% on average, but can also significantly decrease the dimensionality of the projection space. Our results also suggest, that using the sigmoid and tanh activation functions within the projector can outperform ReLU in terms of peak and average classification accuracy. All experiments involving our pretrained projectors are conducted with frozen embeddings, since our test results indicate an advantage compared to using their non-frozen counterparts.
