Table of Contents
Fetching ...

SimBase: A Simple Baseline for Temporal Video Grounding

Peijun Bao, Alex C. Kot

TL;DR

This paper designs SimBase, a network that leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures, and achieves state-of-the-art results on two large-scale datasets.

Abstract

This paper presents SimBase, a simple yet effective baseline for temporal video grounding. While recent advances in temporal grounding have led to impressive performance, they have also driven network architectures toward greater complexity, with a range of methods to (1) capture temporal relationships and (2) achieve effective multimodal fusion. In contrast, this paper explores the question: How effective can a simplified approach be? To investigate, we design SimBase, a network that leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures. For cross-modal interaction, SimBase only employs an element-wise product instead of intricate multimodal fusion. Remarkably, SimBase achieves state-of-the-art results on two large-scale datasets. As a simple yet powerful baseline, we hope SimBase will spark new ideas and streamline future evaluations in temporal video grounding.

SimBase: A Simple Baseline for Temporal Video Grounding

TL;DR

This paper designs SimBase, a network that leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures, and achieves state-of-the-art results on two large-scale datasets.

Abstract

This paper presents SimBase, a simple yet effective baseline for temporal video grounding. While recent advances in temporal grounding have led to impressive performance, they have also driven network architectures toward greater complexity, with a range of methods to (1) capture temporal relationships and (2) achieve effective multimodal fusion. In contrast, this paper explores the question: How effective can a simplified approach be? To investigate, we design SimBase, a network that leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures. For cross-modal interaction, SimBase only employs an element-wise product instead of intricate multimodal fusion. Remarkably, SimBase achieves state-of-the-art results on two large-scale datasets. As a simple yet powerful baseline, we hope SimBase will spark new ideas and streamline future evaluations in temporal video grounding.

Paper Structure

This paper contains 13 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Examples of temporal video grounding. The goal of temporal video grounding is to localize the temporal moment in an untrimmed video based on a given sentence query.
  • Figure 2: Overview of Simple Baseline (SimBase) for temporal video grounding. SimBase leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures. For cross-modal interaction, SimBase exploits Hadamard product rather than complex interaction of language and video.