Table of Contents
Fetching ...

Semantic Alignment for Multimodal Large Language Models

Tao Wu, Mengze Li, Jingyuan Chen, Wei Ji, Wang Lin, Jinyang Gao, Kun Kuang, Zhou Zhao, Fei Wu

TL;DR

The paper tackles semantic misalignment in multimodal large language models when handling cross-image instructions with diverse contexts. It introduces SAM, a bidirectional semantic guidance framework that aligns visual tokens across images through a perception stage (Part A) and a contextual guidance stage (Part B), aided by a novel W-former and an adaptive patch-weighting mechanism. To train and evaluate the approach, the authors curate MmLINK, a 69K-sample dataset generated via a 2-step synthesis pipeline to create contextually diverse image pairs with shared objects. Empirically, SAM achieves large CIDEr improvements over state-of-the-art methods on group captioning (+37%) and storytelling (+22%), demonstrating improved inter-image correlation understanding and coherent cross-modal reasoning. The work offers a path toward more reliable and context-aware multimodal reasoning in large language models.

Abstract

Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM. As the test bed, we propose a large-scale dataset named MmLINK consisting of 69K samples. Different from most existing datasets for MLLMs fine-tuning, our MmLINK dataset comprises multi-modal instructions with significantly diverse images. Extensive experiments on the group captioning task and the storytelling task prove the effectiveness of our SAM model, surpassing the state-of-the-art methods by a large margin (+37% for group captioning and +22% for storytelling on CIDEr score). Project page: https://mccartney01.github.io/SAM.

Semantic Alignment for Multimodal Large Language Models

TL;DR

The paper tackles semantic misalignment in multimodal large language models when handling cross-image instructions with diverse contexts. It introduces SAM, a bidirectional semantic guidance framework that aligns visual tokens across images through a perception stage (Part A) and a contextual guidance stage (Part B), aided by a novel W-former and an adaptive patch-weighting mechanism. To train and evaluate the approach, the authors curate MmLINK, a 69K-sample dataset generated via a 2-step synthesis pipeline to create contextually diverse image pairs with shared objects. Empirically, SAM achieves large CIDEr improvements over state-of-the-art methods on group captioning (+37%) and storytelling (+22%), demonstrating improved inter-image correlation understanding and coherent cross-modal reasoning. The work offers a path toward more reliable and context-aware multimodal reasoning in large language models.

Abstract

Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM. As the test bed, we propose a large-scale dataset named MmLINK consisting of 69K samples. Different from most existing datasets for MLLMs fine-tuning, our MmLINK dataset comprises multi-modal instructions with significantly diverse images. Extensive experiments on the group captioning task and the storytelling task prove the effectiveness of our SAM model, surpassing the state-of-the-art methods by a large margin (+37% for group captioning and +22% for storytelling on CIDEr score). Project page: https://mccartney01.github.io/SAM.
Paper Structure (16 sections, 1 equation, 5 figures, 4 tables)

This paper contains 16 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: GPT-4V shows great performance in pinpointing differences between highly similar images (a), but struggles to align knowledge concepts across images featuring notably varied contexts (b). Example in (a) is sourced from the preliminary exploration report of GPT-4V dawn.
  • Figure 2: Demonstration of our proposed 2-step sample synthesis pipeline. We begin by selecting images featuring characters in different poses (A), along with 2 another distinct characters (B, C). The selected images are segmented to isolate each character, after which they are merged into mask images. Inpainting technology is then utilized to fill in the background areas of these mask images to obtain the final images, using descriptions generated by ChatGPT. Text annotations are generated by InstructBLIP and further refined with ChatGPT.
  • Figure 3: Overview of SAM. The core mechanism of our SAM model is the Bidirectional Semantic Guidance mechanism with two interactive processes: Assisted Visual Token Extraction (Part A) and Contextual Semantic Generation (Part B). In Part A, the Q-former module leverages the contextual semantics $\mathbf{c}_i$, which are generated from contextual images (i.e., images other than the currently perceived image) in the multi-modal instruction in Part B, to guide the extraction of visual tokens from the currently perceived image features. In Part B, the W-former module is utilized to select the contextual semantics from the visual context of contextual images. This selection process is facilitated by the attention mechanism in the adaptive adjustment, along with assistance from the initial visual tokens $\mathbf{h}_i$, which are extracted from the currently perceived image in Part A.
  • Figure 4: Case examples generated by SAM and other MLLMs. Other MLLMs' answers show either weak instruct-following ability or contain hallucinations, while SAM successfully performs semantic alignment and produces accurate responses.
  • Figure 5: Average scores on 6 datasets of different interaction layers. $l$ is the layer that conveys initial visual tokens, $k$ is the layer that conveys contextual semantics, $k\geq l$ ($l$ and $k$ are defined in Section \ref{['Part_A']}). The original 10 points are marked, and the surface are interpolated from these 10 points.