Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification
Lexiang Hu, Youze Xue, Dian Li, Gang Liu, Zhouchen Lin
TL;DR
The paper addresses the limitation of existing multimodal embeddings that mostly capture global semantics by introducing AGFF-Embed, which jointly learns a global embedding and multiple fine-grained embeddings guided by learnable prompts. Four perception-pattern similarities (global-to-global, fine-grained-to-global, global-to-fine-grained, and fine-grained-to-fine-grained) are computed and fused with a smooth $\log\mathrm{sumexp}$ aggregator to form a final similarity, enabling flexible, adaptive fusion across tasks. The approach is made compatible with Explicit Gradient Amplification to boost hard negatives within batches, removing the need for additional data edits. Empirical results on MMEB and MMVP-VLM show state-of-the-art performance in both general and fine-grained understanding, and extensive ablations confirm the contribution of each component and the effectiveness of the fusion strategy.
Abstract
Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.
