Table of Contents
Fetching ...

Multi-level Cross-modal Alignment for Image Clustering

Liping Qiu, Qin Zhang, Xiaojun Chen, Shaotian Cai

TL;DR

This work tackles unsupervised image clustering by addressing erroneous image-text alignments in cross-modal pretraining. It introduces Multi-level Cross-modal Alignment (MCA), which first constructs a compact semantic space from WordNet using hierarchy-aware filtering, then jointly optimizes image and text embeddings under three alignment levels: instance, prototype, and semantic. The approach is backed by theoretical results showing sublinear convergence and a bounded clustering risk, and is empirically validated on five benchmarks where MCA consistently outperforms strong baselines. The method offers a principled way to fix cross-modal misalignments and improve clustering quality, with potential impact on scalable, label-free image organization in diverse domains.

Abstract

Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could produce poor-quality pseudo-labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel \textbf{Multi-level Cross-modal Alignment} method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method.

Multi-level Cross-modal Alignment for Image Clustering

TL;DR

This work tackles unsupervised image clustering by addressing erroneous image-text alignments in cross-modal pretraining. It introduces Multi-level Cross-modal Alignment (MCA), which first constructs a compact semantic space from WordNet using hierarchy-aware filtering, then jointly optimizes image and text embeddings under three alignment levels: instance, prototype, and semantic. The approach is backed by theoretical results showing sublinear convergence and a bounded clustering risk, and is empirically validated on five benchmarks where MCA consistently outperforms strong baselines. The method offers a principled way to fix cross-modal misalignments and improve clustering quality, with potential impact on scalable, label-free image organization in diverse domains.

Abstract

Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could produce poor-quality pseudo-labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel \textbf{Multi-level Cross-modal Alignment} method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method.
Paper Structure (25 sections, 7 theorems, 48 equations, 15 figures, 5 tables)

This paper contains 25 sections, 7 theorems, 48 equations, 15 figures, 5 tables.

Key Result

Theorem 5

Suppose that $g(\mathcal{S};\varphi)$ is twice differential with bounded gradients and Hessians, and $\mathcal{L}(g(\mathcal{S};\varphi))$ has $L$-Lipschitz continuous gradient. Suppose that the learning rate $\eta_{\varphi}$ satisfies $\eta_{\varphi}=\min \{1, \frac{C}{\sqrt{T}}\}$ for some $C>0$,

Figures (15)

  • Figure 1: The nearest noun (selected from WorldNet miller1995wordnet) for the images in STL10, where the image and text embeddings are obtained via CLIP zhou2021learning. The green words correspond to the correct alignments, while the red words indicate incorrect alignments.
  • Figure 2: The framework of MCA consists of three parts: (1) Semantic space construction. (2) Image consistency learning (3) Multi-level cross-modal alignment. The thickness of lines in adaptive instance-level alignment reflects the magnitude of attention scores.
  • Figure 3: The average accuracy of 10 runs of pseudo-labels with epochs on Cifar100-20 and ImageNet-Dogs evolves.
  • Figure 4: Example of pseudo-label generation in MCA. The words below (on the right side of) the images are ground-truth/ neighboring labels and the red color indicates irrelevant texts. The blue block in the semantic probability indicates the class the left word is assigned to (with the largest probability).
  • Figure 5: Sensitivity analysis of $k_S$ and $k_p$.
  • ...and 10 more figures

Theorems & Definitions (13)

  • Theorem 5
  • Theorem 6
  • Proof 1
  • Lemma 7
  • Proof 2
  • Lemma 8
  • Proof 3
  • Lemma 9
  • Proof 4
  • Lemma 10
  • ...and 3 more