$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Ye Liu; Jixuan He; Wanhua Li; Junsik Kim; Donglai Wei; Hanspeter Pfister; Chang Wen Chen

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, Chang Wen Chen

TL;DR

This work tackles video temporal grounding by reframing CLIP as a multi-layer spatial-temporal feature source and introducing Reversed Recurrent Tuning (R^2-Tuning), a memory- and parameter-efficient transfer-learning framework. A lightweight R^2 Block is attached to the tail of the CLIP encoder to perform query-modulated spatial pooling followed by recurrent temporal refinement, producing a spatial-temporal representation used by MR, HD, and VS heads. To calibrate granularity across CLIP layers, the approach employs video-level and layer-wise contrastive losses, enabling flexible handling of coarse-to-fine queries without retraining the full backbone. Experiments across six VTG benchmarks demonstrate state-of-the-art performance without additional temporal backbones, highlighting the practicality of efficient image-to-video transfer learning for untrimmed videos and suggesting avenues for future work, including incorporating audio modalities.

Abstract

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning ($R^2$-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight $R^2$ Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, $R^2$ Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. $R^2$-Tuning achieves state-of-the-art performance across three VTG tasks (i.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

TL;DR

Abstract

-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight

Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP,

Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme.

-Tuning achieves state-of-the-art performance across three VTG tasks (i.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.

Paper Structure (47 sections, 14 equations, 11 figures, 12 tables)

This paper contains 47 sections, 14 equations, 11 figures, 12 tables.

Introduction
Related Work
CLIP for Video Understanding
Video Temporal Grounding
Parameter-Efficient Transfer Learning
Methodology
Problem Formulation
Overview
Reversed Recurrent Tuning
Query-Modulated Spatial Pooling
Recurrent Temporal Refinement
Granularity Calibration
Prediction Heads
Foreground-Background Classification
Boundary Regression
...and 32 more sections

Figures (11)

Figure 1: Video temporal grounding (VTG) contains three video-language understanding problems, i.e., moment retrieval (MR), highlight detection (HD), and video summarization (VS).
Figure 2: Moment retrieval mAP with different backbones on QVHighlights val split. CLIP's potential of temporal modeling was not fully exploited.
Figure 3: Different architectural designs for CLIP-based image-to-video transfer learning. The gray rectangle in (d) denotes the progressively refined spatial-temporal features.
Figure 4: Overall architecture of our framework. The input video and query are first encoded by frozen CLIP radford2021learning encoders. Their multi-layer outputs are then recurrently fused and refined by a learnable $\rm R^2$ Block to construct spatial-temporal representations $h$, which would be scaled up/down to construct a temporal feature pyramid, followed by three heads for MR, HD, and VS, respectively.
Figure 5: Detailed architecture of the $\rm R^2$ Block. It can be split into two parts: a) query-modulated spatial pooling, and b) recurrent temporal refinement. Note that the [CLS] tokens of visual features are omitted for clarity.
...and 6 more figures

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

TL;DR

Abstract

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Authors

TL;DR

Abstract

Table of Contents

Figures (11)