Table of Contents
Fetching ...

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, Yujiu Yang

TL;DR

HyperSeg addresses universal pixel-level segmentation for both images and videos by integrating a Fine-grained Visual Perceiver, Hybrid Entity Recognition, and a Temporal Adapter into a light VLLM framework with dual prompts and multi-task training. The approach enables fine-grained visual understanding and long-range temporal reasoning, achieving strong results on diverse segmentation benchmarks, including complex reasoning tasks. Key contributions include enabling VLLMs to perform multi-task, cross-domain segmentation with improved detail capture and temporal coherence, and demonstrating superior performance over prior VLLM-based methods. This work advances universal segmentation by showing how vision-language models can be extended to fine-grained, temporally-aware perception tasks across modalities.

Abstract

This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

TL;DR

HyperSeg addresses universal pixel-level segmentation for both images and videos by integrating a Fine-grained Visual Perceiver, Hybrid Entity Recognition, and a Temporal Adapter into a light VLLM framework with dual prompts and multi-task training. The approach enables fine-grained visual understanding and long-range temporal reasoning, achieving strong results on diverse segmentation benchmarks, including complex reasoning tasks. Key contributions include enabling VLLMs to perform multi-task, cross-domain segmentation with improved detail capture and temporal coherence, and demonstrating superior performance over prior VLLM-based methods. This work advances universal segmentation by showing how vision-language models can be extended to fine-grained, temporally-aware perception tasks across modalities.

Abstract

This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.

Paper Structure

This paper contains 26 sections, 8 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Illustration of our HyperSeg which can conduct image and video segmentation tasks with various language and visual instructions. Additionally, HyperSeg can handle complicated reasoning perception tasks compared with previous universal segmentation methods. To our knowledge, HyperSeg is the first VLLM-based universal segmentation model with perception and complex reasoning abilities in both image and video domains.
  • Figure 2: Overview of HyperSeg. HyperSeg encodes the visual input in a multi-grained manner and concatenates the prompt for different perception tasks. We feed learnable fine-grained tokens into a Fine-grained Visual Perceiver (FVP) to integrate multi-scale high-resolution image features into LLM for detailed visual learning and to facilitate space-time information propagation for video understanding. Additionally, we use the semantically enhanced mask tokens and prompt embedding to finally generate the segmentation masks and class scores for generic segmentation, and instance embedding for video instance association.
  • Figure 3: The comparison of different recognition strategies. (a) Generation-Only Lai2023LISARSRen2023PixelLMPR: both the semantic recognition (existing objects) and their mask tokens are generated by LLM. (b) Decode-Only zhang2024psalmzhang2024omg: prompt embedding and mask tokens are decoded from LLM. The present objects are then determined by their similarity scores. (c) Hybrid (ours): prompt embedding is decoded from LLM while the semantically enhanced mask tokens are generated by LLM. Their similarity scores reflect the objects' presence.
  • Figure 4: Comparison between previous vision perceiver and our FVP. (a): previous vision perceiver li2023blipbai2023qwen uses the coarse single-scale CLIP visual features which are inadequate for fine-grained perception tasks. (b): FVP encodes the multi-scale visual features into fine-grained tokens.
  • Figure 5: Qualitative results of HyperSeg’s capability in referring expression segmentation.
  • ...and 6 more figures