Table of Contents
Fetching ...

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang

TL;DR

Sa2VA presents a unified framework that merges SAM-2 with a vision-language LLM to achieve dense, grounded understanding for both images and videos. By adopting a one-shot instruction-tuning approach and a decoupled architecture with a learnable [SEG] prompt, it enables tasks ranging from referring segmentation to visual question answering and grounded captioning. The paper introduces Ref-SAV, a large auto-labeled video grounding dataset, and demonstrates strong performance across image/video grounding benchmarks, video VQA, and conversational tasks, while remaining extensible to other VLMs. Its open-source release and detailed ablations position Sa2VA as a versatile baseline for pixel-level multimodal systems and dense grounding in real-world scenarios.

Abstract

This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

TL;DR

Sa2VA presents a unified framework that merges SAM-2 with a vision-language LLM to achieve dense, grounded understanding for both images and videos. By adopting a one-shot instruction-tuning approach and a decoupled architecture with a learnable [SEG] prompt, it enables tasks ranging from referring segmentation to visual question answering and grounded captioning. The paper introduces Ref-SAV, a large auto-labeled video grounding dataset, and demonstrates strong performance across image/video grounding benchmarks, video VQA, and conversational tasks, while remaining extensible to other VLMs. Its open-source release and detailed ablations position Sa2VA as a versatile baseline for pixel-level multimodal systems and dense grounding in real-world scenarios.

Abstract

This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.
Paper Structure (15 sections, 2 equations, 8 figures, 20 tables, 1 algorithm)

This paper contains 15 sections, 2 equations, 8 figures, 20 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of capabilities of our proposed Sa2VA. (a). Given a video, Sa2VA can segment the referred object and understand the whole scene. (b). Sa2VA supports image chat, video chat, image referring segmentation, video referring segmentation, and grounded caption generation with single-shot instruction-tuning. (c). Sa2VA achieves strong results on multiple images, video referring segmentation, and chat benchmarks compared to existing MLLMs hanoona2023GLaMMOMGLLaVAyan2024visa.
  • Figure 2: Proposed Sa2VA model. The model first encodes the input texts, visual prompts, images, and videos into token embeddings. These tokens are then processed through a large language model (LLM). The output text tokens are used to generate the "[SEG]" token and associated language outputs. The SAM-2 decoder receives the image and video features from the SAM-2 encoder, along with the "[SEG]" token, to generate corresponding image and video masks. Modules with a fire icon are trained during the one-shot instruction-tuning.
  • Figure 3: Data annotation pipeline. Our proposed automatic data annotation pipeline consists of three stages: object-level, scene-level, and video-level text expression annotation.
  • Figure 4: Visualization results on GCG tasks. Top: our method. Bottom: OMG-LLaVA OMGLLaVA. Note that, our method has stronger and fined-grained grounding ability and text alignment than OMG-LLaVA OMGLLaVA, previous strong baseline.
  • Figure 5: Visualization results on image referring segmentation task.
  • ...and 3 more figures