Table of Contents
Fetching ...

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan

TL;DR

OMG-LLaVA addresses the gap between image-level understanding and fine-grained pixel-level reasoning by unifying image-, object-, and pixel-level tasks in a single framework. It uses a frozen universal perception module (OMG-Seg) and a single LLM, enhanced by a perception prior embedding, to enable end-to-end token-to-token generation for captioning, referring segmentation, and grounded conversations. The approach achieves competitive or superior performance across multiple datasets with a simpler, cost-efficient architecture and supports visual prompts for interactive segmentation. This work advances MLLM design by reducing components while expanding capabilities across multiple granularities of visual reasoning.

Abstract

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

TL;DR

OMG-LLaVA addresses the gap between image-level understanding and fine-grained pixel-level reasoning by unifying image-, object-, and pixel-level tasks in a single framework. It uses a frozen universal perception module (OMG-Seg) and a single LLM, enhanced by a perception prior embedding, to enable end-to-end token-to-token generation for captioning, referring segmentation, and grounded conversations. The approach achieves competitive or superior performance across multiple datasets with a simpler, cost-efficient architecture and supports visual prompts for interactive segmentation. This work advances MLLM design by reducing components while expanding capabilities across multiple granularities of visual reasoning.

Abstract

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.
Paper Structure (16 sections, 5 equations, 14 figures, 10 tables)

This paper contains 16 sections, 5 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: The comprehensive capabilities of OMG-LLaVA. OMG-LLaVA can handle a variety of pixel-level, object-level, and image-level understanding and reasoning tasks.
  • Figure 2: Summary of Current MLLM Architectures: (a) MLLMs with only image-level capability, including liu2023llavaliu2023llavaplusliu2024llavanextli2024mini, etc., (b) MLLMs with object-level capability, including yuan2023ospreyhanoona2023GLaMM, (c) MLLMs with pixel-level capability, including lai2023lisaren2023pixellm, etc., (d) MLLMs with both object-level and pixel-level capabilities but with a very complex system, such as hanoona2023GLaMM, (e) OMG-LLaVA's architecture, which possesses an elegant and simple design while having image-level, object-level, and pixel-level capabilities.
  • Figure 3: The Overview of OMG-LLaVA. OMG-LLaVA consists of OMG-Seg and LLM. OMG-Seg tokenizes the image into pixel-centric visual tokens, the detected objects, and inputs visual prompts into object-centric visual tokens. Additionally, the [SEG] token output by LLM is decoded by OMG-Seg into segmentation masks. OMG-Seg remains frozen at all stages.
  • Figure 4: The Architecture of the OMG Decoder. A simple attention mask generation strategy enables the OMG decoder to encode point, box, and mask prompts.
  • Figure 5: The process of the perception prior embedding strategy. The perception prior embedding strategy integrates object queries into image features based on segmentation prior.
  • ...and 9 more figures