Table of Contents
Fetching ...

INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

TL;DR

This work addresses the lack of fine-grained instance-level grounding in large multimodal models (LMMs) by introducing Inst-IT, which comprises an instance-focused benchmark (Inst-IT Bench), a large-scale Inst-IT Dataset with multi-level instance annotations for both images and videos, and a continuous instruction-tuning pipeline that uses explicit visual prompts (SoMs) to guide spatial-temporal grounding. The method converts rich annotations into instruction-tuning data and employs an image-video joint training regime to improve both instance-level understanding and general multimodal comprehension. Empirical results show substantial gains on Inst-IT Bench and cross-instance benchmarks like RefCOCOg and ViP-Bench, as well as improvements on generic image/video benchmarks, demonstrating robust instance grounding and transferability. Overall, Inst-IT demonstrates that explicit visual prompting combined with continuous supervised fine-tuning can substantially enhance fine-grained instance understanding while broadening generic multimodal capabilities.

Abstract

Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more fine-grained comprehension and alignment. Instance-level understanding is crucial for LMMs, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the SOTA LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning for instance guidance. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, enhanced by Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench and other instance understanding benchmarks, but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our method not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning

TL;DR

This work addresses the lack of fine-grained instance-level grounding in large multimodal models (LMMs) by introducing Inst-IT, which comprises an instance-focused benchmark (Inst-IT Bench), a large-scale Inst-IT Dataset with multi-level instance annotations for both images and videos, and a continuous instruction-tuning pipeline that uses explicit visual prompts (SoMs) to guide spatial-temporal grounding. The method converts rich annotations into instruction-tuning data and employs an image-video joint training regime to improve both instance-level understanding and general multimodal comprehension. Empirical results show substantial gains on Inst-IT Bench and cross-instance benchmarks like RefCOCOg and ViP-Bench, as well as improvements on generic image/video benchmarks, demonstrating robust instance grounding and transferability. Overall, Inst-IT demonstrates that explicit visual prompting combined with continuous supervised fine-tuning can substantially enhance fine-grained instance understanding while broadening generic multimodal capabilities.

Abstract

Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more fine-grained comprehension and alignment. Instance-level understanding is crucial for LMMs, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the SOTA LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning for instance guidance. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, enhanced by Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench and other instance understanding benchmarks, but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our method not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

Paper Structure

This paper contains 28 sections, 3 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: (a) LMMs struggle with instance understanding, failing to capture the nuanced details of instances specified in user queries. (b) Our instance-centric data annotation pipeline, providing multi-level annotations for individual instances in images and videos.
  • Figure 2: Visualization of data structure in Inst-IT Dataset. For each video, we provide (a) $N$ frame-level annotations, (b) a video-level description, and (c) $M$ open-ended question-answer pairs. A complete example data can be found in \ref{['sec:supp_dataset_exp']}.
  • Figure 3: Set-of-Marks visual prompting on the original videos. Each instance is assigned a unique numeric ID, which remains consistent across all frames.
  • Figure 4: GPT-4o-based open-ended question answering correctness assessment. The underlined parts in the figure are included only when evaluating the video split, while the italicized parts will be replaced by the actual sample for scoring.
  • Figure 5: Frame-level annotation task prompt, the italicized part are placeholders for the actual inputs.
  • ...and 5 more figures