Table of Contents
Fetching ...

Agent-Centric Personalized Multiple Clustering with Multi-Modal LLMs

Ziye Chen, Yiqun Duan, Riheng Zhu, Zhenbang Sun, Mingming Gong

TL;DR

The paper tackles the problem of generating diverse, user-aligned clustering partitions by introducing an agent-centric framework that uses multi-modal LLMs (MLLMs) as autonomous agents to traverse a user-interest-biased relational graph. It builds sparse, high-quality embeddings with a learned weighting scheme $w(u,v)=\sigma(\beta \mathcal{H}(u)^{\top}\mathcal{H}(v))$ and employs a two-stage process: (i) MLLM-driven graph construction with LoRA fine-tuning and GPT-4-based pseudo supervision, and (ii) agent-based traversal and membership assessment to form clusters per connected component, followed by a global merge of redundant clusters. The approach achieves state-of-the-art clustering performance across five benchmarks, with notable results such as Card Order NMI $0.9667$ and Card Suits NMI $0.9481$, and even perfect NMI/RI on Fruit color and species clustering, while reducing traversal costs via graph sparsification. Overall, the method provides a scalable, interpretable framework for personalized clustering that aligns closely with user-defined criteria and demonstrates strong empirical gains over CLIP-based embeddings and traditional clustering baselines. The work highlights the practical potential of integrating reasoning-capable MLLMs into clustering workflows for enhanced customization and explainability.

Abstract

Personalized multiple clustering aims to generate diverse partitions of a dataset based on different user-specific aspects, rather than a single clustering. It has recently drawn research interest for accommodating varying user preferences. Recent approaches primarily use CLIP embeddings with proxy learning to extract representations biased toward user clustering preferences. However, CLIP primarily focuses on coarse image-text alignment, lacking a deep contextual understanding of user interests. To overcome these limitations, we propose an agent-centric personalized clustering framework that leverages multi-modal large language models (MLLMs) as agents to comprehensively traverse a relational graph to search for clusters based on user interests. Due to the advanced reasoning mechanism of MLLMs, the obtained clusters align more closely with user-defined criteria than those obtained from CLIP-based representations. To reduce computational overhead, we shorten the agents' traversal path by constructing a relational graph using user-interest-biased embeddings extracted by MLLMs. A large number of weakly connected edges can be filtered out based on embedding similarity, facilitating an efficient traversal search for agents. Experimental results show that the proposed method achieves NMI scores of 0.9667 and 0.9481 on the Card Order and Card Suits benchmarks, respectively, largely improving the SOTA model by over 140%.

Agent-Centric Personalized Multiple Clustering with Multi-Modal LLMs

TL;DR

The paper tackles the problem of generating diverse, user-aligned clustering partitions by introducing an agent-centric framework that uses multi-modal LLMs (MLLMs) as autonomous agents to traverse a user-interest-biased relational graph. It builds sparse, high-quality embeddings with a learned weighting scheme and employs a two-stage process: (i) MLLM-driven graph construction with LoRA fine-tuning and GPT-4-based pseudo supervision, and (ii) agent-based traversal and membership assessment to form clusters per connected component, followed by a global merge of redundant clusters. The approach achieves state-of-the-art clustering performance across five benchmarks, with notable results such as Card Order NMI and Card Suits NMI , and even perfect NMI/RI on Fruit color and species clustering, while reducing traversal costs via graph sparsification. Overall, the method provides a scalable, interpretable framework for personalized clustering that aligns closely with user-defined criteria and demonstrates strong empirical gains over CLIP-based embeddings and traditional clustering baselines. The work highlights the practical potential of integrating reasoning-capable MLLMs into clustering workflows for enhanced customization and explainability.

Abstract

Personalized multiple clustering aims to generate diverse partitions of a dataset based on different user-specific aspects, rather than a single clustering. It has recently drawn research interest for accommodating varying user preferences. Recent approaches primarily use CLIP embeddings with proxy learning to extract representations biased toward user clustering preferences. However, CLIP primarily focuses on coarse image-text alignment, lacking a deep contextual understanding of user interests. To overcome these limitations, we propose an agent-centric personalized clustering framework that leverages multi-modal large language models (MLLMs) as agents to comprehensively traverse a relational graph to search for clusters based on user interests. Due to the advanced reasoning mechanism of MLLMs, the obtained clusters align more closely with user-defined criteria than those obtained from CLIP-based representations. To reduce computational overhead, we shorten the agents' traversal path by constructing a relational graph using user-interest-biased embeddings extracted by MLLMs. A large number of weakly connected edges can be filtered out based on embedding similarity, facilitating an efficient traversal search for agents. Experimental results show that the proposed method achieves NMI scores of 0.9667 and 0.9481 on the Card Order and Card Suits benchmarks, respectively, largely improving the SOTA model by over 140%.

Paper Structure

This paper contains 32 sections, 9 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: The workflow of the proposed agent-centric multiple clustering framework, which obtains a personalized clustering by using MLLMs as agents to traverse a relational graph based on user preferences. The relational graph is constructed from MLLM embeddings biased toward user interests.
  • Figure 2: Overview of the Agent-Centric Personalized Multiple Clustering Framework. (a) MLLM-based graph construction, where image embeddings are extracted using MLLM based on user interests, from which a relational graph is constructed. (b) Agent-centric graph traversal, where agents search for clusters by traversing the graph. (c) Rollout of the graph traversal process, where agents expand cluster iteratively by assessing neighboring nodes based on user-defined criteria.
  • Figure 3: Illustration of image embedding extraction using MLLM based on user interests.
  • Figure 4: Illustration of Agent-based assessment of candidate membership according to user interests.
  • Figure 5: Graph density vs. clustering metrics and traversal steps.
  • ...and 1 more figures