Table of Contents
Fetching ...

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, Tong Zhang

TL;DR

PerceptionGPT tackles the challenge of endowing vision-language models with robust visual perception by embedding perception signals directly into the LLM's token space using a single <vis> marker and lightweight encoders/decoders. This approach reduces training data and parameter requirements, eliminates discretization errors, and shortens decoding sequences, achieving state-of-the-art results on referring expression tasks with parameter-efficient fine-tuning. The method combines autoregressive language modeling with vision-specific losses (box and mask) and employs adaptive multi-layer feature fusion to leverage information across ViT layers. Experiments across REC, RES, captioning, and VQA demonstrate strong performance and efficiency gains, highlighting the practical potential of end-to-end P-VLMs that exploit LLM embeddings for visual perception. Overall, PerceptionGPT offers a scalable, data-efficient path toward flexible, high-quality visual understanding in LLM-based systems.

Abstract

The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs). However, effectively harnessing VLLMs for intricate visual perception tasks remains a challenge. In this paper, we present a novel end-to-end framework named PerceptionGPT, which efficiently and effectively equips the VLLMs with visual perception abilities by leveraging the representation power of LLMs' token embedding. Our proposed method treats the token embedding of the LLM as the carrier of spatial information, then leverage lightweight visual task encoders and decoders to perform visual perception tasks (e.g., detection, segmentation). Our approach significantly alleviates the training difficulty suffered by previous approaches that formulate the visual outputs as discrete tokens, and enables achieving superior performance with fewer trainable parameters, less training data and shorted training time. Moreover, as only one token embedding is required to decode the visual outputs, the resulting sequence length during inference is significantly reduced. Consequently, our approach enables accurate and flexible representations, seamless integration of visual perception tasks, and efficient handling of a multiple of visual outputs. We validate the effectiveness and efficiency of our approach through extensive experiments. The results demonstrate significant improvements over previous methods with much fewer trainable parameters and GPU hours, which facilitates future research in enabling LLMs with visual perception abilities.

PerceptionGPT: Effectively Fusing Visual Perception into LLM

TL;DR

PerceptionGPT tackles the challenge of endowing vision-language models with robust visual perception by embedding perception signals directly into the LLM's token space using a single <vis> marker and lightweight encoders/decoders. This approach reduces training data and parameter requirements, eliminates discretization errors, and shortens decoding sequences, achieving state-of-the-art results on referring expression tasks with parameter-efficient fine-tuning. The method combines autoregressive language modeling with vision-specific losses (box and mask) and employs adaptive multi-layer feature fusion to leverage information across ViT layers. Experiments across REC, RES, captioning, and VQA demonstrate strong performance and efficiency gains, highlighting the practical potential of end-to-end P-VLMs that exploit LLM embeddings for visual perception. Overall, PerceptionGPT offers a scalable, data-efficient path toward flexible, high-quality visual understanding in LLM-based systems.

Abstract

The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs). However, effectively harnessing VLLMs for intricate visual perception tasks remains a challenge. In this paper, we present a novel end-to-end framework named PerceptionGPT, which efficiently and effectively equips the VLLMs with visual perception abilities by leveraging the representation power of LLMs' token embedding. Our proposed method treats the token embedding of the LLM as the carrier of spatial information, then leverage lightweight visual task encoders and decoders to perform visual perception tasks (e.g., detection, segmentation). Our approach significantly alleviates the training difficulty suffered by previous approaches that formulate the visual outputs as discrete tokens, and enables achieving superior performance with fewer trainable parameters, less training data and shorted training time. Moreover, as only one token embedding is required to decode the visual outputs, the resulting sequence length during inference is significantly reduced. Consequently, our approach enables accurate and flexible representations, seamless integration of visual perception tasks, and efficient handling of a multiple of visual outputs. We validate the effectiveness and efficiency of our approach through extensive experiments. The results demonstrate significant improvements over previous methods with much fewer trainable parameters and GPU hours, which facilitates future research in enabling LLMs with visual perception abilities.
Paper Structure (24 sections, 6 equations, 4 figures, 7 tables)

This paper contains 24 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration of different strategies to encode and decode visual perception information. Previous approaches formulate the visual information into discrete tokens in the same way as text. On the other hand, our PerceptionGPT leverages lightweight visual encoder and decoders to fuse such information into the embedding space of LLM.
  • Figure 2: The illustration of framework of PerceptionGPT. The model is trained in an end-to-end manner, rather than outputting the location and coordinates in the form of discrete tokens, each box and mask can be represented by one single continuous embedding from the LLM's output. Both the perception encoder and decoder are lightweight architectures (e.g., MLP, ResNet) trained from scratch, without relying on any pretrained visual experts.
  • Figure 3: Visualization of results from PerceptionGPT. Our proposed framework enables effectively fusing visual perception capability into P-VLM while maintaining its generation and reasoning ability. Row [1-4], row [5-8] demonstrate spot captioning and reasoning segmentation/detection, respectively. row [10] demonstrates image-level captioning and region-level captioning.
  • Figure 4: Left: The performance of different strategy for fusing visual features on various tasks. Right: The magnitude of learnt adaptive weights for visual features across different VIT layers.