Table of Contents
Fetching ...

Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering

Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Ziyue Peng, Zewei Liu, Hewei Wang, Jiayi Zhang, Edith C. H. Ngai

TL;DR

Multi-DProxy tackles the problem of user-irrelevant clustering in multi-view data by learning dynamic textual proxies that steer multimodal fusion. It introduces gated cross-modal fusion, dynamic candidate management, and dual-constraint proxy optimization to generate personalized clusterings aligned with user concepts. The framework leverages frozen CLIP encoders and GPT-4 generated concept candidates, with adaptive fusion and contrastive learning to ensure both semantic coherence and discriminability. Empirical results across diverse benchmarks demonstrate state-of-the-art performance, complemented by theoretical analyses of proxy stability and cross-modal discriminability, highlighting practical impact for personalized multi-clustering tasks.

Abstract

Multiple clustering aims to discover diverse latent structures from different perspectives, yet existing methods generate exhaustive clusterings without discerning user interest, necessitating laborious manual screening. Current multi-modal solutions suffer from static semantic rigidity: predefined candidate words fail to adapt to dataset-specific concepts, and fixed fusion strategies ignore evolving feature interactions. To overcome these limitations, we propose Multi-DProxy, a novel multi-modal dynamic proxy learning framework that leverages cross-modal alignment through learnable textual proxies. Multi-DProxy introduces 1) gated cross-modal fusion that synthesizes discriminative joint representations by adaptively modeling feature interactions. 2) dual-constraint proxy optimization where user interest constraints enforce semantic consistency with domain concepts while concept constraints employ hard example mining to enhance cluster discrimination. 3) dynamic candidate management that refines textual proxies through iterative clustering feedback. Therefore, Multi-DProxy not only effectively captures a user's interest through proxies but also enables the identification of relevant clusterings with greater precision. Extensive experiments demonstrate state-of-the-art performance with significant improvements over existing methods across a broad set of multi-clustering benchmarks.

Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering

TL;DR

Multi-DProxy tackles the problem of user-irrelevant clustering in multi-view data by learning dynamic textual proxies that steer multimodal fusion. It introduces gated cross-modal fusion, dynamic candidate management, and dual-constraint proxy optimization to generate personalized clusterings aligned with user concepts. The framework leverages frozen CLIP encoders and GPT-4 generated concept candidates, with adaptive fusion and contrastive learning to ensure both semantic coherence and discriminability. Empirical results across diverse benchmarks demonstrate state-of-the-art performance, complemented by theoretical analyses of proxy stability and cross-modal discriminability, highlighting practical impact for personalized multi-clustering tasks.

Abstract

Multiple clustering aims to discover diverse latent structures from different perspectives, yet existing methods generate exhaustive clusterings without discerning user interest, necessitating laborious manual screening. Current multi-modal solutions suffer from static semantic rigidity: predefined candidate words fail to adapt to dataset-specific concepts, and fixed fusion strategies ignore evolving feature interactions. To overcome these limitations, we propose Multi-DProxy, a novel multi-modal dynamic proxy learning framework that leverages cross-modal alignment through learnable textual proxies. Multi-DProxy introduces 1) gated cross-modal fusion that synthesizes discriminative joint representations by adaptively modeling feature interactions. 2) dual-constraint proxy optimization where user interest constraints enforce semantic consistency with domain concepts while concept constraints employ hard example mining to enhance cluster discrimination. 3) dynamic candidate management that refines textual proxies through iterative clustering feedback. Therefore, Multi-DProxy not only effectively captures a user's interest through proxies but also enables the identification of relevant clusterings with greater precision. Extensive experiments demonstrate state-of-the-art performance with significant improvements over existing methods across a broad set of multi-clustering benchmarks.

Paper Structure

This paper contains 31 sections, 2 theorems, 27 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

(Proxy Stability) The dynamic candidate update reduces semantic drift by bounding proxy divergence: where $\gamma=\max _i \sum_k \alpha_{i k}$ is the maximum attention mass (bounded by 1), and $\mathbf{c}_{k}^{(t)}$ denotes candidate $k$ at iteration $t$. The bound ensures proxy stability during candidate updates.

Figures (5)

  • Figure 1: Overview of the Multi-DProxy framework. The central pipeline illustrates the overall architecture, while the key components are detailed on both sides: (1) Dynamic Candidate Management updates candidate words every $R$ epochs; (2) Gated Cross-modal Fusion integrates visual and textual representations; (3) Cross-modal Alignment reduces modality discrepancies; (4) Concept Discrimination Constraints enhance cluster separability; and (5) User Interest Constraints ensure alignment with domain-specific concepts.
  • Figure 2: Ablation study. For each dataset, the average performance across all clustering objects is reported.
  • Figure 3: Visualization of textual, visual, and fused representations on the Fruit dataset.
  • Figure 4: Hyperparameter analysis on the Fruit dataset.
  • Figure 5: Efficiency study on the Fruit and Card datasets.

Theorems & Definitions (3)

  • Remark 1
  • Proposition 1
  • Theorem 1