Table of Contents
Fetching ...

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, Salman Khan

TL;DR

VideoGLaMM tackles the challenge of fine-grained pixel-level grounding in videos by integrating a frozen Large Language Model with dual vision encoders (spatial and temporal) and a spatio-temporal pixel decoder. Tunable Vision-to-Language and Language-to-Vision adapters enable close, end-to-end alignment between visual inputs and linguistic outputs, producing text that is grounded by precise segmentation masks. A semi-automatic annotation pipeline yields a large grounded dataset (≈38k video-QA triplets, 83k objects, 671k masks), enabling robust spatio-temporal grounding across grounded conversation generation, visual grounding, and referring video segmentation. Across GCG, visual grounding, and referring segmentation tasks, VideoGLaMM achieves state-of-the-art performance, highlighting the practical impact of pixel-level grounding for video understanding and interaction.

Abstract

Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

TL;DR

VideoGLaMM tackles the challenge of fine-grained pixel-level grounding in videos by integrating a frozen Large Language Model with dual vision encoders (spatial and temporal) and a spatio-temporal pixel decoder. Tunable Vision-to-Language and Language-to-Vision adapters enable close, end-to-end alignment between visual inputs and linguistic outputs, producing text that is grounded by precise segmentation masks. A semi-automatic annotation pipeline yields a large grounded dataset (≈38k video-QA triplets, 83k objects, 671k masks), enabling robust spatio-temporal grounding across grounded conversation generation, visual grounding, and referring video segmentation. Across GCG, visual grounding, and referring segmentation tasks, VideoGLaMM achieves state-of-the-art performance, highlighting the practical impact of pixel-level grounding for video understanding and interaction.

Abstract

Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.

Paper Structure

This paper contains 19 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Grounded Conversation with VideoGLaMM. Our proposed multimodal video conversational model provides text responses grounded at the pixel level in the input video. The generated masks are spatio-temporally consistent across frames. The fine-grained grounded outputs from VideoGLaMM describe different levels of granularity, e.g., person, objects (bike), stuff (road), and explain object and scene attributes. Existing Video-LMMs do not offer pixel-level grounded conversational capability.
  • Figure 2: Working of VideoGLaMM. VideoGLaMM consists of a dual spatio-temporal encoder for encoding image and video level features. The spatial features represent the local information and the temporal features represent global information. The spatial and temporal tokens are passed through V-L adapters and concatenated with the text tokens, before feeding to LLM. A L-V projector is employed to align LLM's response with the visual space of pixel decoder. Finally, the aligned LLM features along with the frame features from a frame encoder are passed to a grounded pixel decoder, to obtain the fine-grained object masks corresponding to the LLM response.
  • Figure 3: Proposed Semi-automatic Annotation Pipeline. Our dataset for grounded conversation generation (GCG) is built from three video dataset types: i) Videos having masks only: Object patches are extracted from video frames using masks and processed by the Gemini-Pro model for initial object descriptions, which are then refined to produce detailed object captions. These refined captions and masks are again fed to Gemini-Pro model to create dense grounded captions. ii) Videos having bbox annotations and captions: Frames are first processed with a Video-LMM to generate a comprehensive caption which is combined with the original caption and fed to GPT-4o to obtain dense grounded captions. Masks are generated using frames and ground-truth bounding boxes with the SAM model. iii) Videos having object bboxes and referring expressions: Frames, bounding boxes, and referring expressions are input to GPT-4o for dense grounded captions, while masks are generated by feeding frames and bounding boxes to the SAM model.
  • Figure 4: Qualitative results of VideoGLaMM on grounded conversation generation (GCG). Given user queries, the VideoGLaMM generates textual responses and grounds objects and phrases using pixel-level masks, showing its detailed understanding of the video.
  • Figure 5: Conditional Video Generation using VideoGLaMM.
  • ...and 2 more figures