Table of Contents
Fetching ...

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng

TL;DR

This work targets pixel-level understanding with a radically simplified architecture by proposing Pixel-SAIL, an encoder-free, single-transformer MLLM. It introduces three plug-in improvements—learnable up-sampling, visual prompt injection, and dense feature distillation—and a new PerBench benchmark to assess detailed captions, visual-prompt-based questions, and visual-text referring segmentation. Empirically, Pixel-SAIL achieves competitive or superior performance on multiple pixel-grounded benchmarks with smaller models and simpler pipelines, underscoring the viability of encoder-free designs for fine-grained tasks. The authors also provide a data engine and training protocol to enable broader exploration of pixel-grounded MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

TL;DR

This work targets pixel-level understanding with a radically simplified architecture by proposing Pixel-SAIL, an encoder-free, single-transformer MLLM. It introduces three plug-in improvements—learnable up-sampling, visual prompt injection, and dense feature distillation—and a new PerBench benchmark to assess detailed captions, visual-prompt-based questions, and visual-text referring segmentation. Empirically, Pixel-SAIL achieves competitive or superior performance on multiple pixel-grounded benchmarks with smaller models and simpler pipelines, underscoring the viability of encoder-free designs for fine-grained tasks. The authors also provide a data engine and training protocol to enable broader exploration of pixel-grounded MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.

Paper Structure

This paper contains 15 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison of current MLLMs for pixel-wise understanding with our method. (a) and (b). Current MLLMs for pixel-wise understanding feature highly complex system architectures, including an LLM, a CLIP-like vision backbone, an object token extraction model, a segmentation vision backbone, and a SAM-like decoder. (c). Our method employs only a single transformer.
  • Figure 2: The architecture of our proposed plain baseline and Pixel-SAIL. Pixel-SAIL is as simple and elegant as the plain baseline but demonstrates significantly improved performance. The examples on the right demonstrate that Pixel-SAIL possesses the capability for general conversation and comprehensive pixel-grounded understanding.
  • Figure 3: Visual examples on our PerBench. Best view it in color and zoom in.
  • Figure 4: Visualization results of Pixel-SAIL on diversity tasks. Best view it in color and zoom in. From top to bottom are visual prompt-based object caption, single/multi-object referring segmentation, vision-text referring segmentation, image caption and QA, and visual-prompt based conversation. Visual prompts in the form of points and boxes are converted into mask prompts using SAM kirillov2023segment. For more visualization results and comparisons with other MLLMs, please refer to the appendix.
  • Figure 5: Image feature visualization results. From left to right are the image feature of the base MLLM, the image feature of Pixel-SAIL, and the mask feature of Pixel-SAIL.
  • ...and 3 more figures