Table of Contents
Fetching ...

Multi-Modal Generative Embedding Model

Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

TL;DR

MM-GEM proposes a unified multimodal model that encapsulates both embedding and generative objectives within a single large language model, enabled by a Vision PoolAggregator for fine-grained visual representations. The approach jointly optimizes cross-modal alignment and image captioning with a single objective L_MM-GEM = L_Emb + L_Gen, where embedding uses Info-NCE across image and sentence embeddings and generation uses auto-regressive captioning conditioned on image features. A two-stage training regime and region-aware RoI pooling support fine-grained captioning and retrieval, while an advanced text module improves long-form text understanding and retrieval. Empirical results show MM-GEM is competitive with state-of-the-art embedding models on retrieval and classification, maintains strong captioning capability, and notably enhances fine-grained and long-form text tasks, highlighting the practical viability of unifying generative and embedding capabilities in one modality-driven model.

Abstract

Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding. To explore the minimalism of multi-modal paradigms, we attempt to achieve only one model per modality in this work. We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one Large Language Model. We also propose a PoolAggregator to boost efficiency and enable the ability of fine-grained embedding and generation. A surprising finding is that these two objectives do not significantly conflict with each other. For example, MM-GEM instantiated from ViT-Large and TinyLlama shows competitive performance on benchmarks for multimodal embedding models such as cross-modal retrieval and zero-shot classification, while has good ability of image captioning. Additionally, MM-GEM can seamlessly execute region-level image caption generation and retrieval tasks. Besides, the advanced text model in MM-GEM brings over 5% improvement in Recall@1 for long text and image retrieval.

Multi-Modal Generative Embedding Model

TL;DR

MM-GEM proposes a unified multimodal model that encapsulates both embedding and generative objectives within a single large language model, enabled by a Vision PoolAggregator for fine-grained visual representations. The approach jointly optimizes cross-modal alignment and image captioning with a single objective L_MM-GEM = L_Emb + L_Gen, where embedding uses Info-NCE across image and sentence embeddings and generation uses auto-regressive captioning conditioned on image features. A two-stage training regime and region-aware RoI pooling support fine-grained captioning and retrieval, while an advanced text module improves long-form text understanding and retrieval. Empirical results show MM-GEM is competitive with state-of-the-art embedding models on retrieval and classification, maintains strong captioning capability, and notably enhances fine-grained and long-form text tasks, highlighting the practical viability of unifying generative and embedding capabilities in one modality-driven model.

Abstract

Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding. To explore the minimalism of multi-modal paradigms, we attempt to achieve only one model per modality in this work. We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one Large Language Model. We also propose a PoolAggregator to boost efficiency and enable the ability of fine-grained embedding and generation. A surprising finding is that these two objectives do not significantly conflict with each other. For example, MM-GEM instantiated from ViT-Large and TinyLlama shows competitive performance on benchmarks for multimodal embedding models such as cross-modal retrieval and zero-shot classification, while has good ability of image captioning. Additionally, MM-GEM can seamlessly execute region-level image caption generation and retrieval tasks. Besides, the advanced text model in MM-GEM brings over 5% improvement in Recall@1 for long text and image retrieval.
Paper Structure (16 sections, 7 equations, 3 figures, 8 tables)

This paper contains 16 sections, 7 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Overview of MM-GEM, in which a large language model acts as both text encoder for embedding and text decoder for generation. The visual feature is aligned with the LLM by several projection layers and a PoolAggregator.
  • Figure 2: Visualization of fine-grained description generation. This figure shows the captioning results of using region features from the visual feature map as input. The text in the same color as the bounding box in the figure is the description of the corresponding area. Text on a gray background indicates results without region description data.
  • Figure 3: Visualization of fine-grained image-text retrieval. This figure shows the similarity between the visual feature map and the text feature at two stages. The blue borders and undertones represent the result from the pre-training stage one, and yellow borders and undertones illustrate the results of stage two. The text superimposed on the image corresponds to the input text.