Table of Contents
Fetching ...

ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022

Naiyuan Liu, Xiaohan Wang, Xiaobo Li, Yi Yang, Yueting Zhuang

TL;DR

The paper tackles temporal moment localization for natural language queries in long Ego4D videos, addressing data scarcity and long-duration challenges. It introduces a multi-scale cross-modal transformer with cross-attention, augmented by a video frame-level contrastive loss and two training-time data augmentations (SW and VS). Key contributions include the multi-scale cross-modal architecture, frame-level contrastive objective, and augmentation strategies that yield state-of-the-art results, especially on R1 metrics. The approach demonstrates strong localization performance and practical benefits for long-video NLQ tasks, with ensemble gains further enhancing results at test time.

Abstract

In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2022. Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained. To tackle this task, we propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips. Besides, we propose two data augmentation strategies to increase the diversity of training samples. The experimental results demonstrate the effectiveness of our method. The final submission ranked first on the leaderboard.

ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022

TL;DR

The paper tackles temporal moment localization for natural language queries in long Ego4D videos, addressing data scarcity and long-duration challenges. It introduces a multi-scale cross-modal transformer with cross-attention, augmented by a video frame-level contrastive loss and two training-time data augmentations (SW and VS). Key contributions include the multi-scale cross-modal architecture, frame-level contrastive objective, and augmentation strategies that yield state-of-the-art results, especially on R1 metrics. The approach demonstrates strong localization performance and practical benefits for long-video NLQ tasks, with ensemble gains further enhancing results at test time.

Abstract

In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2022. Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained. To tackle this task, we propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips. Besides, we propose two data augmentation strategies to increase the diversity of training samples. The experimental results demonstrate the effectiveness of our method. The final submission ranked first on the leaderboard.
Paper Structure (8 sections, 4 equations, 2 figures, 3 tables)

This paper contains 8 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall framework of our approach. (a) depicts our single-scale cross-modal transformer. (b) shows the details of our multi-scale cross-modal transformer.
  • Figure 2: Illustration of data augmentation. (a) shows how the variable-length sliding window sampling strategy (SW) works. (b) shows how the video splicing strategy (VS) works. (c) is a combination of these two data augmentations which leads to better performance.