Improve Contrastive Clustering Performance by Multiple Fusing-Augmenting ViT Blocks
Cheng Wang, Shuisheng Zhou, Fengjiao Peng, Jin Sheng, Feng Ye, Yinli Dong
TL;DR
MFAVBs-CC addresses the limited exploitation of positive-pair complementarities in contrastive clustering by explicitly fusing intermediate ViT features across two augmented views using multiple fusing-augmenting ViT blocks and incorporating CLIP-derived semantic priors as a multimodal anchor. The method combines a fusion-and-splitting ViT block design with a CLIP-informed multimodal anchor to guide attention, and optimizes instance- and cluster-level objectives via end-to-end learning. Empirical results across seven public datasets, including remote sensing data, show consistent improvements over state-of-the-art ViT-based contrastive clustering methods, with notable gains in ACC, NMI, and ARI. The work demonstrates that explicit fusion of intermediate features and multimodal priors can enhance discriminability and clustering quality while remaining adaptable to different backbones and scalable with additional fusion blocks.
Abstract
In the field of image clustering, the widely used contrastive learning networks improve clustering performance by maximizing the similarity between positive pairs and the dissimilarity of negative pairs of the inputs. Extant contrastive learning networks, whose two encoders often implicitly interact with each other by parameter sharing or momentum updating, may not fully exploit the complementarity and similarity of the positive pairs to extract clustering features from input data. To explicitly fuse the learned features of positive pairs, we design a novel multiple fusing-augmenting ViT blocks (MFAVBs) based on the excellent feature learning ability of Vision Transformers (ViT). Firstly, two preprocessed augmentions as positive pairs are separately fed into two shared-weight ViTs, then their output features are fused to input into a larger ViT. Secondly, the learned features are split into a pair of new augmented positive samples and passed to the next FAVBs, enabling multiple fusion and augmention through MFAVBs operations. Finally, the learned features are projected into both instance-level and clustering-level spaces to calculate the cross-entropy loss, followed by parameter updates by backpropagation to finalize the training process. To further enhance ability of the model to distinguish between similar images, our input data for the network we propose is preprocessed augmentions with features extracted from the CLIP pretrained model. Our experiments on seven public datasets demonstrate that MFAVBs serving as the backbone for contrastive clustering outperforms the state-of-the-art techniques in terms of clustering performance.
