Table of Contents
Fetching ...

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, Xuansong Xie

TL;DR

AnyRef is proposed, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references, such as texts, boxes, images, or audio, and achieves state-of-the-art results across multiple benchmarks.

Abstract

Multimodal Large Language Model (MLLMs) leverages Large Language Models as a cognitive framework for diverse visual-language tasks. Recent efforts have been made to equip MLLMs with visual perceiving and grounding capabilities. However, there still remains a gap in providing fine-grained pixel-level perceptions and extending interactions beyond text-specific inputs. In this work, we propose {\bf{AnyRef}}, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references, such as texts, boxes, images, or audio. This innovation empowers users with greater flexibility to engage with the model beyond textual and regional prompts, without modality-specific designs. Through our proposed refocusing mechanism, the generated grounding output is guided to better focus on the referenced object, implicitly incorporating additional pixel-level supervision. This simple modification utilizes attention scores generated during the inference of LLM, eliminating the need for extra computations while exhibiting performance enhancements in both grounding masks and referring expressions. With only publicly available training data, our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

TL;DR

AnyRef is proposed, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references, such as texts, boxes, images, or audio, and achieves state-of-the-art results across multiple benchmarks.

Abstract

Multimodal Large Language Model (MLLMs) leverages Large Language Models as a cognitive framework for diverse visual-language tasks. Recent efforts have been made to equip MLLMs with visual perceiving and grounding capabilities. However, there still remains a gap in providing fine-grained pixel-level perceptions and extending interactions beyond text-specific inputs. In this work, we propose {\bf{AnyRef}}, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references, such as texts, boxes, images, or audio. This innovation empowers users with greater flexibility to engage with the model beyond textual and regional prompts, without modality-specific designs. Through our proposed refocusing mechanism, the generated grounding output is guided to better focus on the referenced object, implicitly incorporating additional pixel-level supervision. This simple modification utilizes attention scores generated during the inference of LLM, eliminating the need for extra computations while exhibiting performance enhancements in both grounding masks and referring expressions. With only publicly available training data, our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
Paper Structure (20 sections, 3 equations, 5 figures, 8 tables)

This paper contains 20 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Multi-modality Referring Segmentation and Expression Generation with AnyRef. Our model possesses the capacity to generate natural language descriptions as well as pixel-wise grounding masks for the referred object. It accommodates various referring modalities such as text, bounding boxes, images and audio, enabling more flexible user interactions.
  • Figure 2: Overall pipeline of AnyRef. Vision-language, audio-language projection and MLP layers are omitted for simplicity and clarity. The Unified Referring Representation (\ref{['unifiedreferringrepresentation']}) receives references from diverse types of modalities and transforms them into embeddings aligned with the LLM. The Refocusing Mechanism (\ref{['refocusing']}) enhances the embedding from the single <obj> token with grounded textural embeddings, thus providing a broader representational capacity.
  • Figure 3: Qualitative results of AnyRef's applicable capabilities on multiple tasks, including (a) referring expression segmentation, (b) region-level captioning and grounding, (c) image-level referring segmentation and (d) audio-visual segmentation. AnyRef demonstrates proficiency in generating both textual responses and pixel-level perceptions across diverse modality instructions.
  • Figure 4: Comparison of generated expressions between ground-truth and LLM-based methods.
  • Figure 5: Visualization of mask embeddings before and after the refocusing mechanism. original denotes original mask embeddings, while vehicle, person, and animal represent the updated mask embeddings corresponding to their respective referring objects contained in the textural expression.