From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts
Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Xin Liu, Zhenbo Li
TL;DR
This paper critiques the traditional point-based prompt learning paradigm in vision-language models and proposes Points-to-Clouds (P2C), a diffusion-inspired framework that learns a semantic cloud rather than a single prompt vector. It introduces Dynamic Prompt Denoising (DPD) with structured Gaussian Mixture noise and an annealed schedule, plus an auxiliary V-L Mapper denoising loss to enforce deep cross-modal alignment. Empirically, P2C achieves state-of-the-art base-to-novel generalization on 11 datasets (HM 79.7%), and demonstrates competitive domain generalization and cross-dataset transfer, at the cost of modest training overhead and hyperparameter sensitivity. The work lays groundwork for robust, distribution-aware prompts, with future directions toward adaptive noise and more scalable denoising strategies.
Abstract
Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at https://vranlee.github.io/P2C/.
