Towards More Unified In-context Visual Understanding

Dianmo Sheng; Dongdong Chen; Zhentao Tan; Qiankun Liu; Qi Chu; Jianmin Bao; Tao Gong; Bin Liu; Shengwei Xu; Nenghai Yu

Towards More Unified In-context Visual Understanding

Dianmo Sheng, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Tao Gong, Bin Liu, Shengwei Xu, Nenghai Yu

TL;DR

This work addresses the limitation of existing visual in-context learning systems by proposing a unified multimodal in-context learning framework that supports multimodal outputs through modality-specific tokenization and a shared interleaved token space. It combines vision-language prompts (3.1), unified multimodal representations (3.2), and a GPT-2–style decoder with sparse Mixture-of-Experts (3.3) to enable in-context learning on tasks such as class-aware segmentation and dense captioning. Evaluations on MS-COCO and Visual Genome demonstrate competitive results against specialized models and state-of-the-art vision-language baselines, with strong ablations showing the benefits of text-based bbox prompts and multi-task co-training. The approach advances multimodal in-context learning in a unified pipeline and lays groundwork for extending to additional modalities and tasks in the future.

Abstract

The rapid advancement of large language models (LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently, ICL has been employed in visual understanding tasks, such as semantic segmentation and image captioning, yielding promising results. However, existing visual ICL framework can not enable producing content across multiple modalities, which limits their potential usage scenarios. To address this issue, we present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space, structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them, facilitating in-context learning. Thanks to this design, the model is capable of handling in-context vision understanding tasks with multimodal output in a unified pipeline.Experimental results demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall, our research takes a further step toward unified multimodal in-context learning.

Towards More Unified In-context Visual Understanding

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 12 figures, 7 tables)

This paper contains 16 sections, 3 equations, 12 figures, 7 tables.

Introduction
Related Works
Method
Vision-Language Prompt Design
Unified Multimodal Representations.
Model Architecture and Training Objective
Experiments
Datasets and Benchmarks.
Implementation Details.
Ablation Studies
Comparison with State-of-the-art Methods
Conclusion
Model Architecture and Configuration
Additional Quantative Analysis
Additional Qualitative Results
...and 1 more sections

Figures (12)

Figure 1: Motivation illustration of our method. In earlier efforts, existing in-context visual understanding models were confined to a particular output modality. For instance, SegGPT specialized in "Image$\rightarrow$Image" applications, tailored for tasks involving image segmentation. Similarly, Flamingo was purpose-built for "Image$\rightarrow$Text" scenarios, focusing on language-centric tasks such as image captioning. In contrast, we take a further attempt to design a unified model capable of handling multimodal in-context visual understanding tasks for "Image$\rightarrow$Image / Text" scenarios.
Figure 2: Overview of our unified multimodal representations pipeline with two stages. During the multimodal quantization phase, visual and linguistic inputs are encoded into discrete tokens via modality-specialized tokenizers: specifically, VQGAN's tokenizer for visual data and GPT-2's tokenizer for texts. After that, in the unified embedding stage, multimodal discrete tokens are formatted as an interleaved sequence with special tokens. Then a unified embedding layer projects the sequence into general representations.
Figure 3: Overview of our pipeline. Here, we take the CA-ICL captioning task as an example. Multiple in-context samples and the input pair are first tokenized using modality-specific tokenizers and then projected into unified embedding representations. After undergoing interleaved concatenation, the tokens are inputted into the model for generative modeling.
Figure 4: Class-aware in-context understanding task definitions. For the sake of easy demonstration, only one in-context sample is used here. The blue boxes $\square$ on the left display the inputs of the model, while the red boxes $\square$ on the right show the corresponding output. (In the absence of additional clarification, subsequent notations convey the same meaning.)
Figure 5: Analysis of the impact of including bbox information. For better visualization, the ground truth bboxes are indicated by rose boxes $\square$, while the predicted bboxes are highlighted in green boxes $\square$. With the bbox information in prompts, the model yields more precise descriptions that are aligned with the specified region locations.
...and 7 more figures

Towards More Unified In-context Visual Understanding

TL;DR

Abstract

Towards More Unified In-context Visual Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (12)