Multi-modal Generation via Cross-Modal In-Context Learning

Amandeep Kumar; Muzammal Naseer; Sanath Narayan; Rao Muhammad Anwer; Salman Khan; Hisham Cholakkal

Multi-modal Generation via Cross-Modal In-Context Learning

Amandeep Kumar, Muzammal Naseer, Sanath Narayan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal

TL;DR

The paper tackles generating images from long, complex multimodal prompts while preserving context and multi-object fidelity. It introduces MGCC, a framework that fuses frozen large language models with diffusion through a Cross-Modal Refinement Module to learn cross-modal dependencies in the LLM embedding space and a Contextual Object Grounding Module to predict object layouts. MGCC aligns image tokens within the LLM space, uses in-context bounding box generation to condition diffusion, and demonstrates improved CLIP similarities on Visual Story Generation and Visual Dialogue Context compared with SOTA. The work shows that cross-modal refinement and grounding enable context-aware multimodal generation and dialogue capabilities with training efficiency.

Abstract

In this work, we study the problem of generating novel images from complex multimodal prompt sequences. While existing methods achieve promising results for text-to-image generation, they often struggle to capture fine-grained details from lengthy prompts and maintain contextual coherence within prompt sequences. Moreover, they often result in misaligned image generation for prompt sequences featuring multiple objects. To address this, we propose a Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that generates novel images from complex multimodal prompt sequences by leveraging the combined capabilities of large language models (LLMs) and diffusion models. Our MGCC comprises a novel Cross-Modal Refinement module to explicitly learn cross-modal dependencies between the text and image in the LLM embedding space, and a contextual object grounding module to generate object bounding boxes specifically targeting scenes with multiple objects. Our MGCC demonstrates a diverse range of multimodal capabilities, like novel image generation, the facilitation of multimodal dialogue, and generation of texts. Experimental evaluations on two benchmark datasets, demonstrate the effectiveness of our method. On Visual Story Generation (VIST) dataset with multimodal inputs, our MGCC achieves a CLIP Similarity score of $0.652$ compared to SOTA GILL $0.641$. Similarly, on Visual Dialogue Context (VisDial) having lengthy dialogue sequences, our MGCC achieves an impressive CLIP score of $0.660$, largely outperforming existing SOTA method scoring $0.645$. Code: https://github.com/VIROBO-15/MGCC

Multi-modal Generation via Cross-Modal In-Context Learning

TL;DR

Abstract

compared to SOTA GILL

. Similarly, on Visual Dialogue Context (VisDial) having lengthy dialogue sequences, our MGCC achieves an impressive CLIP score of

, largely outperforming existing SOTA method scoring

. Code: https://github.com/VIROBO-15/MGCC

Paper Structure (10 sections, 4 equations, 6 figures, 4 tables)

This paper contains 10 sections, 4 equations, 6 figures, 4 tables.

Introduction
Related Works:
Method
Overall Framework
Cross Modal Refinement Module
Contextual Object Grounding Module
Experiment
Implementation Details
Experimental Results
Conclusion

Figures (6)

Figure 1: Example images depicting the impact of progressively integrating our cross-modal refinement module (CMRM) and contextual object grounding module (COGM) into the baseline. In first row, the baseline generates an image of "cookies and coffee in a plate" which doesn't align with the earlier prompts "the boss is teaching the new employee to prepare coffee and snack." Although the integration of our CMRM module to baseline improves semantic understanding, the generated image still fails to include the person instance in the scene. Finally, by incorporating our GCO (grounding contextual objects), we achieve better alignment with the ground truth, resulting in an image that accurately matches the number of "persons" mentioned in the earlier prompt. Similarly, in second row, baseline struggles to generate an image consistent with the text "the glowing embers of a campfire is so relaxing". Our refinement module comprehends the prompts and generates "people and campfire", although the last prompt is most aligned with the "campfire". Our grounding module generates bounding boxes for the "campfire", resulting in a more aligned image with the specified context.
Figure 2: Overall framework of our model, MGCC to generate novel images using multimodal prompts. During training, (a) our model first align the image into the LLM token embedding space. (b) To generate the novel images we introduce special image token $[I]$ to the LLM Vocabulary. We refine these image token $[I]$ in the LLM feature space by introducing a novel cross-modal refinement module (CMRM), and then align these refined features in the clip text encoder space. The refined image token $\hat{F}_{I}$ are then taken as input to the Transformer Mapper $S_w$ to map the tokens into the clip text embedding space as $f_g(y)$. (c) During inference, we use Contextual Object Grounding, to generate the bounding boxes for the objects present in the scene $\{b_i\}_{i=1}^{p}$. We condition theses bounding boxes $\{b_i\}_{i=1}^{p}$ along with refined image tokens embedding $f_g$ on the diffusion $D$ to generate the final image $I_g$.
Figure 3: The images on the left showcase examples illustrating the multimodal generation capabilities of our MGCC, which operates on sequential multimodal input dialogues arranged from top to bottom. On the right-hand side, the images demonstrate: (a) the model's ability to perform grounded generation, (b) its proficiency in following instructions, and (c) its capability in generating descriptive captions for images.
Figure 3: Image generation performance on CC3M sharma2018conceptual and VIST huang2016visual with our proposed contribution onto the baseline.
Figure 4: In the first row, the baseline model produces a holistic representation of the scene, including the "statue" and the "flowers". However, our model excels in generating "hue" flower picture. In the second row, the baseline model fails to comprehend the context sequence about the "room" and the "guest", whereas our model successfully captures this context, resulting in generating the "room having the bed" with the help of the context of "relaxing". Moving to the third row, the baseline and diffusion loses the context as the prompt sequence increases and generates "trees" and "old lady" whereas our model can generate the images very much aligned with the text "barrels in the aging room" groundtruth.
...and 1 more figures

Multi-modal Generation via Cross-Modal In-Context Learning

TL;DR

Abstract

Multi-modal Generation via Cross-Modal In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)