PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Yang Liu; Pengxiang Ding; Siteng Huang; Min Zhang; Han Zhao; Donglin Wang

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Yang Liu, Pengxiang Ding, Siteng Huang, Min Zhang, Han Zhao, Donglin Wang

TL;DR

This paper proposes a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed PiTe, that exhibits promising applicable model property and demonstrates astounding capabilities on myriad video-related multi-modal tasks through beat the state-of-the-art methods by a large margin.

Abstract

Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models (LVLMs) have emerged as a pivotal advancement, bridging the gap between image and text. However, video making it challenging for LVLMs to perform adequately due to the complexity of the relationship between language and spatial-temporal data structure. Recent Large Video-Language Models (LVidLMs) align feature of static visual data like image into latent space of language feature, by general multi-modal tasks to leverage abilities of LLMs sufficiently. In this paper, we explore fine-grained alignment approach via object trajectory for different modalities across both spatial and temporal dimensions simultaneously. Thus, we propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed PiTe, that exhibits promising applicable model property. To achieve fine-grained video-language alignment, we curate a multi-modal pre-training dataset PiTe-143k, the dataset provision of moving trajectories in pixel level for all individual objects, that appear and mention in the video and caption both, by our automatic annotation pipeline. Meanwhile, PiTe demonstrates astounding capabilities on myriad video-related multi-modal tasks through beat the state-of-the-art methods by a large margin.

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

TL;DR

Abstract

Paper Structure (30 sections, 9 equations, 5 figures, 5 tables)

This paper contains 30 sections, 9 equations, 5 figures, 5 tables.

Introduction
Related Work
Large Language Models
Large Visual-Language Models
Large Video-Language Models
PiTe-143k Dataset
Referring Expression Segmentation
Point Tracking
PiTe
Architecture
Vision Encoder.
Visual Adapter.
Large Language Model.
Training Strategy
Stage 1: Referring Expression Localization.
...and 15 more sections

Figures (5)

Figure 1: Comparison with existing LVidLMs in terms of alignment paradigm and performance. For Fig. \ref{['fig:radar']}, QA, TG, DC denote question answering, temporal grounding and dense captioning, respectively.
Figure 2: Automatic annotation pipeline for PiTe-143k. The video sample in the figure showcases two events positioned at the commencement and conclusion of the video. The procedure for extracting noun phrases by SuParconf/acl/ZhangLZ20conf/ijcai/ZhangZL20 is elucidated in Fig. \ref{['fig:tree']}.
Figure 3: Two samples of constituency parser for Noun Phrase (NP) extraction.
Figure 4: Schematic of PiTe framework for video-language alignment.
Figure 5: PiTe's video understanding capabilities and performance comparison across varying tracking point quantities.

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

TL;DR

Abstract

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (5)