Agent-Centric Personalized Multiple Clustering with Multi-Modal LLMs
Ziye Chen, Yiqun Duan, Riheng Zhu, Zhenbang Sun, Mingming Gong
TL;DR
The paper tackles the problem of generating diverse, user-aligned clustering partitions by introducing an agent-centric framework that uses multi-modal LLMs (MLLMs) as autonomous agents to traverse a user-interest-biased relational graph. It builds sparse, high-quality embeddings with a learned weighting scheme $w(u,v)=\sigma(\beta \mathcal{H}(u)^{\top}\mathcal{H}(v))$ and employs a two-stage process: (i) MLLM-driven graph construction with LoRA fine-tuning and GPT-4-based pseudo supervision, and (ii) agent-based traversal and membership assessment to form clusters per connected component, followed by a global merge of redundant clusters. The approach achieves state-of-the-art clustering performance across five benchmarks, with notable results such as Card Order NMI $0.9667$ and Card Suits NMI $0.9481$, and even perfect NMI/RI on Fruit color and species clustering, while reducing traversal costs via graph sparsification. Overall, the method provides a scalable, interpretable framework for personalized clustering that aligns closely with user-defined criteria and demonstrates strong empirical gains over CLIP-based embeddings and traditional clustering baselines. The work highlights the practical potential of integrating reasoning-capable MLLMs into clustering workflows for enhanced customization and explainability.
Abstract
Personalized multiple clustering aims to generate diverse partitions of a dataset based on different user-specific aspects, rather than a single clustering. It has recently drawn research interest for accommodating varying user preferences. Recent approaches primarily use CLIP embeddings with proxy learning to extract representations biased toward user clustering preferences. However, CLIP primarily focuses on coarse image-text alignment, lacking a deep contextual understanding of user interests. To overcome these limitations, we propose an agent-centric personalized clustering framework that leverages multi-modal large language models (MLLMs) as agents to comprehensively traverse a relational graph to search for clusters based on user interests. Due to the advanced reasoning mechanism of MLLMs, the obtained clusters align more closely with user-defined criteria than those obtained from CLIP-based representations. To reduce computational overhead, we shorten the agents' traversal path by constructing a relational graph using user-interest-biased embeddings extracted by MLLMs. A large number of weakly connected edges can be filtered out based on embedding similarity, facilitating an efficient traversal search for agents. Experimental results show that the proposed method achieves NMI scores of 0.9667 and 0.9481 on the Card Order and Card Suits benchmarks, respectively, largely improving the SOTA model by over 140%.
