Table of Contents
Fetching ...

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, Chang Wen Chen

TL;DR

This work tackles video temporal grounding by reframing CLIP as a multi-layer spatial-temporal feature source and introducing Reversed Recurrent Tuning (R^2-Tuning), a memory- and parameter-efficient transfer-learning framework. A lightweight R^2 Block is attached to the tail of the CLIP encoder to perform query-modulated spatial pooling followed by recurrent temporal refinement, producing a spatial-temporal representation used by MR, HD, and VS heads. To calibrate granularity across CLIP layers, the approach employs video-level and layer-wise contrastive losses, enabling flexible handling of coarse-to-fine queries without retraining the full backbone. Experiments across six VTG benchmarks demonstrate state-of-the-art performance without additional temporal backbones, highlighting the practicality of efficient image-to-video transfer learning for untrimmed videos and suggesting avenues for future work, including incorporating audio modalities.

Abstract

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning ($R^2$-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight $R^2$ Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, $R^2$ Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. $R^2$-Tuning achieves state-of-the-art performance across three VTG tasks (i.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

TL;DR

This work tackles video temporal grounding by reframing CLIP as a multi-layer spatial-temporal feature source and introducing Reversed Recurrent Tuning (R^2-Tuning), a memory- and parameter-efficient transfer-learning framework. A lightweight R^2 Block is attached to the tail of the CLIP encoder to perform query-modulated spatial pooling followed by recurrent temporal refinement, producing a spatial-temporal representation used by MR, HD, and VS heads. To calibrate granularity across CLIP layers, the approach employs video-level and layer-wise contrastive losses, enabling flexible handling of coarse-to-fine queries without retraining the full backbone. Experiments across six VTG benchmarks demonstrate state-of-the-art performance without additional temporal backbones, highlighting the practicality of efficient image-to-video transfer learning for untrimmed videos and suggesting avenues for future work, including incorporating audio modalities.

Abstract

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning (-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. -Tuning achieves state-of-the-art performance across three VTG tasks (i.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.
Paper Structure (47 sections, 14 equations, 11 figures, 12 tables)

This paper contains 47 sections, 14 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Video temporal grounding (VTG) contains three video-language understanding problems, i.e., moment retrieval (MR), highlight detection (HD), and video summarization (VS).
  • Figure 2: Moment retrieval mAP with different backbones on QVHighlights val split. CLIP's potential of temporal modeling was not fully exploited.
  • Figure 3: Different architectural designs for CLIP-based image-to-video transfer learning. The gray rectangle in (d) denotes the progressively refined spatial-temporal features.
  • Figure 4: Overall architecture of our framework. The input video and query are first encoded by frozen CLIP radford2021learning encoders. Their multi-layer outputs are then recurrently fused and refined by a learnable $\rm R^2$ Block to construct spatial-temporal representations $h$, which would be scaled up/down to construct a temporal feature pyramid, followed by three heads for MR, HD, and VS, respectively.
  • Figure 5: Detailed architecture of the $\rm R^2$ Block. It can be split into two parts: a) query-modulated spatial pooling, and b) recurrent temporal refinement. Note that the [CLS] tokens of visual features are omitted for clarity.
  • ...and 6 more figures