VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, Salman Khan
TL;DR
VideoGLaMM tackles the challenge of fine-grained pixel-level grounding in videos by integrating a frozen Large Language Model with dual vision encoders (spatial and temporal) and a spatio-temporal pixel decoder. Tunable Vision-to-Language and Language-to-Vision adapters enable close, end-to-end alignment between visual inputs and linguistic outputs, producing text that is grounded by precise segmentation masks. A semi-automatic annotation pipeline yields a large grounded dataset (≈38k video-QA triplets, 83k objects, 671k masks), enabling robust spatio-temporal grounding across grounded conversation generation, visual grounding, and referring video segmentation. Across GCG, visual grounding, and referring segmentation tasks, VideoGLaMM achieves state-of-the-art performance, highlighting the practical impact of pixel-level grounding for video understanding and interaction.
Abstract
Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.
