QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering

Xuan-Bac Nguyen; Hoang-Quan Nguyen; Samuel Yen-Chi Chen; Samee U. Khan; Hugh Churchill; Khoa Luu

QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering

Xuan-Bac Nguyen, Hoang-Quan Nguyen, Samuel Yen-Chi Chen, Samee U. Khan, Hugh Churchill, Khoa Luu

TL;DR

This work tackles the heavy computation involved in unsupervised visual clustering on large unlabeled datasets by introducing QClusformer, a quantum transformer-based framework that employs parameterized quantum circuits for self-attention and quantum feature encoding. It presents a complete end-to-end pipeline including amplitude encoding, cosine similarity-based sequence representation, and a clustering-oriented loss to identify hard samples within clusters. Empirical results on MS-Celeb-1M show clear improvements over classical baselines, while DeepFashion results remain competitive, demonstrating the practical viability of quantum-assisted clustering for large-scale vision tasks. The study highlights the potential of quantum transformer architectures to enhance unsupervised visual clustering and motivates further exploration of quantum resources in vision analytics.

Abstract

Unsupervised vision clustering, a cornerstone in computer vision, has been studied for decades, yielding significant outcomes across numerous vision tasks. However, these algorithms involve substantial computational demands when confronted with vast amounts of unlabeled data. Conversely, quantum computing holds promise in expediting unsupervised algorithms when handling large-scale databases. In this study, we introduce QClusformer, a pioneering Transformer-based framework leveraging quantum machines to tackle unsupervised vision clustering challenges. Specifically, we design the Transformer architecture, including the self-attention module and transformer blocks, from a quantum perspective to enable execution on quantum hardware. In addition, we present QClusformer, a variant based on the Transformer architecture, tailored for unsupervised vision clustering tasks. By integrating these elements into an end-to-end framework, QClusformer consistently outperforms previous methods running on classical computers. Empirical evaluations across diverse benchmarks, including MS-Celeb-1M and DeepFashion, underscore the superior performance of QClusformer compared to state-of-the-art methods.

QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 4 figures, 2 tables)

This paper contains 19 sections, 10 equations, 4 figures, 2 tables.

Introduction
Background
Parameterized Quantum Circuit
Visual Clustering
Transformer
Our Proposed Approach
Motivations
Quantum State Feature Encoding
Quantum Transformer
Implementation
Visual Cluster Dataset
Cosine Similarity Encoding
Objective and Loss Functions
Experimental Results
Evaluation Metrics
...and 4 more sections

Figures (4)

Figure 1: An overview of the Quantum Transformer for Clustering framework. Given classical data, i.e., images, the samples are extracted into feature vectors via a classical deep learning model. Then, a k-nearest neighbor algorithm is applied to cluster the samples. To automatically select the correct samples in each cluster, we propose a novel Quantum Clustering Transformer (QClusformer) justifying the correlation between feature vectors.
Figure 2: A framework of the self-attention module on classical data.
Figure 3: The Quantum Self-attention Module. Given $k$ encoded classical feature vectors sized $D$ of a cluster, we encode the feature vectors into $k$ quantum states. Each quantum state uses $n = \lceil\log_2(D)\rceil$ qubits to contain the information of the classical feature vector. After being transformed via a Parameterized Quantum Circuit, each quantum state is measured to obtain the query, key, and value for self-attention.
Figure 4: The MS-Celeb-1M and DeepFashion datasets are illustrated through samples. Each row represents either a subject. The first image in each row denotes the center of a cluster, while the subsequent images are the nearest neighbors of the first one, identified through the k-NN algorithm utilizing quantum features. Images bordered in red signify that they belong to a different class than the first image in the row, whereas those bordered in green share the same class as the first image. Best view in color.

QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering

TL;DR

Abstract

QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering

Authors

TL;DR

Abstract

Table of Contents

Figures (4)