Table of Contents
Fetching ...

Contrastive Language Video Time Pre-training

Hengyue Liu, Kyle Min, Hector A. Valdez, Subarna Tripathi

TL;DR

The paper tackles the challenge of learning aligned language, video, and temporal representations from long-form egocentric videos under tight memory and compute constraints. It introduces LAVITI, a DETR-style, moment-query based pre-training framework that uses a frozen CLIP backbone for visual and language encodings and learns relative temporal embeddings to handle timestamps. A three-way Hungarian matching with Sigmoid contrastive loss aligns visual, language, and temporal embeddings of detected moments, enabling zero-shot natural language query and strong action recognition on CharadesEgo. The method demonstrates memory-efficient training on Ego4D and scalability to thousands of frames, suggesting practical impact for episodic memory and long-range video understanding.

Abstract

We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Different from pre-training on video-text pairs like EgoVLP, LAVITI aims to align language, video, and temporal features by extracting meaningful moments in untrimmed videos. Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features. In addition to vision and language alignment, we introduce relative temporal embeddings (TE) to represent timestamps in videos, which enables contrastive learning of time. Significantly different from traditional approaches, the prediction of a particular timestamp is transformed by computing the similarity score between the predicted TE and all TEs. Furthermore, existing approaches for video understanding are mainly designed for short videos due to high computational complexity and memory footprint. Our method can be trained on the Ego4D dataset with only 8 NVIDIA RTX-3090 GPUs in a day. We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.

Contrastive Language Video Time Pre-training

TL;DR

The paper tackles the challenge of learning aligned language, video, and temporal representations from long-form egocentric videos under tight memory and compute constraints. It introduces LAVITI, a DETR-style, moment-query based pre-training framework that uses a frozen CLIP backbone for visual and language encodings and learns relative temporal embeddings to handle timestamps. A three-way Hungarian matching with Sigmoid contrastive loss aligns visual, language, and temporal embeddings of detected moments, enabling zero-shot natural language query and strong action recognition on CharadesEgo. The method demonstrates memory-efficient training on Ego4D and scalability to thousands of frames, suggesting practical impact for episodic memory and long-range video understanding.

Abstract

We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Different from pre-training on video-text pairs like EgoVLP, LAVITI aims to align language, video, and temporal features by extracting meaningful moments in untrimmed videos. Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features. In addition to vision and language alignment, we introduce relative temporal embeddings (TE) to represent timestamps in videos, which enables contrastive learning of time. Significantly different from traditional approaches, the prediction of a particular timestamp is transformed by computing the similarity score between the predicted TE and all TEs. Furthermore, existing approaches for video understanding are mainly designed for short videos due to high computational complexity and memory footprint. Our method can be trained on the Ego4D dataset with only 8 NVIDIA RTX-3090 GPUs in a day. We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.
Paper Structure (12 sections, 1 equation, 1 figure, 2 tables)

This paper contains 12 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The architecture and training pipeline of LAVITI . We use a set of learnable queries to capture both visual and temporal features, and directly predict the visual (V) and temporal (T) embeddings of potential moments, respectively. Predicted visual embeddings are aligned with ground-truth narration text embeddings (L), and predicted TE are aligned with interpolated TE at ground-truth timestamps.