Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

Lexiang Hu; Youze Xue; Dian Li; Gang Liu; Zhouchen Lin

Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

Lexiang Hu, Youze Xue, Dian Li, Gang Liu, Zhouchen Lin

TL;DR

The paper addresses the limitation of existing multimodal embeddings that mostly capture global semantics by introducing AGFF-Embed, which jointly learns a global embedding and multiple fine-grained embeddings guided by learnable prompts. Four perception-pattern similarities (global-to-global, fine-grained-to-global, global-to-fine-grained, and fine-grained-to-fine-grained) are computed and fused with a smooth $\log\mathrm{sumexp}$ aggregator to form a final similarity, enabling flexible, adaptive fusion across tasks. The approach is made compatible with Explicit Gradient Amplification to boost hard negatives within batches, removing the need for additional data edits. Empirical results on MMEB and MMVP-VLM show state-of-the-art performance in both general and fine-grained understanding, and extensive ablations confirm the contribution of each component and the effectiveness of the fusion strategy.

Abstract

Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.

Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

TL;DR

aggregator to form a final similarity, enabling flexible, adaptive fusion across tasks. The approach is made compatible with Explicit Gradient Amplification to boost hard negatives within batches, removing the need for additional data edits. Empirical results on MMEB and MMVP-VLM show state-of-the-art performance in both general and fine-grained understanding, and extensive ablations confirm the contribution of each component and the effectiveness of the fusion strategy.

Abstract

Paper Structure (19 sections, 34 equations, 2 figures, 8 tables)

This paper contains 19 sections, 34 equations, 2 figures, 8 tables.

Introduction
Related Work
CLIP-based embedding models.
MLLM-based embedding models.
Method
Adaptive Global and Fine-Grained Perceptual Fusion
Explicit Hard Negative Gradient Amplification for AGFF-Embed
Experiment
Training and Evaluation on MMEB
Zero-Shot Fine-Grained Perception Evaluation
Ablation Study
Is the added perceptual pattern necessary?
Is the compatible EGA technique necessary?
Is the smooth $\operatorname{logsumexp}$ similarity aggregation necessary?
Conclusion
...and 4 more sections

Figures (2)

Figure 1: For image-text-to-image-text matching tasks, whether the query and target should focus on global or fine-grained semantic information varies case by case.
Figure 2: Framework of AGFF-Embed. (a): MLLM generates a global embedding and $N$ fine-grained embeddings. (b): Four types of similarity corresponding to four perception patterns are computed and aggregated via $\operatorname{logsumexp}$.

Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

TL;DR

Abstract

Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

Authors

TL;DR

Abstract

Table of Contents

Figures (2)